top of page

Eduovisual

Biostatistics & Population Health

Effect size measures: Cohen's d and clinical interpretation

Clinical Overview and When to Suspect Effect Size Inadequacy

— A trial with n=50,000 can produce p<0.001 for a 1 mmHg BP reduction — statistically significant, clinically trivial

— A small pilot study may show a large effect that fails to reach significance due to underpowering

— Formula: d = (M₁ − M₂) / SD_pooled

— Unitless, allowing comparison across studies measuring different scales (e.g., HAM-D vs PHQ-9 for depression)

— Interpreting trial results where p-value alone is given but clinical meaningfulness is questioned

— Meta-analyses pooling outcomes across heterogeneous instruments

— Comparing two interventions both "statistically significant" but with different magnitudes

— Sample size / power calculations during study design

Mean differences: Cohen's d, Hedges' g (small-sample correction), Glass's Δ

Correlations: Pearson r, R²

Categorical/risk: odds ratio, risk ratio, NNT, phi coefficient

Effect size quantifies the magnitude of a difference or association, independent of sample size — answering "how big?" rather than "is it real?"
P-values tell you whether an effect likely exists; effect sizes tell you whether it matters clinically
Cohen's d is the standardized mean difference between two groups, expressed in standard deviation units
When to invoke effect size on Step 3 stems:
Common effect size families:
Board pearl: A statistically significant result with a tiny Cohen's d (e.g., 0.1) in a megatrial often reflects sample-size–driven precision, not clinical importance. Always ask: would this change my management?
Effect sizes anchor minimal clinically important difference (MCID) discussions — the smallest change a patient perceives as beneficial
Regulatory bodies (FDA) increasingly require effect size reporting alongside p-values for drug approvals, particularly in psychiatric and pain trials where placebo effects are large
On Step 3, expect a stem that pairs a "significant" p-value with a small d and asks whether to adopt the intervention — the answer is usually no, citing lack of clinical meaningfulness
Solid White Background
Presentation Patterns and Key History — How Effect Size Appears in Stems

— "A trial of drug X vs placebo for depression shows mean HAM-D reduction of 2 points (p=0.01, Cohen's d=0.18)..."

— "A new antihypertensive lowers SBP by 1.5 mmHg compared to standard care in 30,000 patients (p<0.001)..."

— "Cognitive behavioral therapy vs waitlist for anxiety: d=0.85, p=0.04, n=40..."

— Meta-analysis forest plots with pooled standardized mean differences (SMDs)

Sample size (n) — large n inflates statistical significance even for trivial effects

Standard deviation of the outcome — needed to contextualize raw differences

Outcome scale (HAM-D, MMSE, VAS pain, BP, HbA1c) — different scales require standardization

MCID — sometimes given explicitly ("a 3-point change on HAM-D is clinically meaningful")

Confidence interval around the effect size

d ≈ 0.2: small effect

d ≈ 0.5: medium effect

d ≈ 0.8: large effect

— d > 1.2: very large; d < 0.1: trivial

— Huge sample sizes with tiny absolute differences

— Significant p-values where the raw difference is below the MCID

— Multiple "positive" trials with conflicting magnitudes

Typical Step 3 stem framings:
Key history elements the stem provides:
Cohen's conventional benchmarks (memorize):
Key distinction: These thresholds are conventions, not laws. A d of 0.2 for a cancer mortality intervention may be huge; a d of 0.8 on a symptom scale with poor reliability may be inflated
Red flags suggesting the answer involves effect size reasoning:
Board pearl: When a stem provides both p-value and effect size, the test writer almost always wants you to weight the effect size more heavily in the clinical decision. P-values are gatekeepers; effect sizes are the actual signal
History should also probe outcome relevance: surrogate (LDL, HbA1c) vs patient-important (MI, mortality, QoL). A large d on a surrogate may not translate to a meaningful patient benefit
Solid White Background
Physical Exam Findings — Visualizing and "Examining" Effect Sizes

d = 0.2: ~85% overlap between group distributions — most patients in treatment group look like most in control

d = 0.5: ~67% overlap — noticeable but substantial overlap

d = 0.8: ~53% overlap — distributions clearly separated but still meaningful overlap

d = 2.0: ~32% overlap — minimal overlap, dramatic separation

— d = 0.2 → 56% chance a random treated patient does better than a random control

— d = 0.5 → 64%

— d = 0.8 → 71%

— Useful for explaining trial results to patients in plain language

— X-axis often labeled SMD or Hedges' g

— Diamond crossing zero = no significant pooled effect

— Width of diamond = 95% CI; narrow = precise

— Heterogeneity (I²) >50% means pooled d may mask varying true effects

Effect sizes don't have a physical exam, but they have visual and graphical correlates Step 3 expects you to interpret
Overlap of distributions is the intuitive anchor for Cohen's d:
Probability of superiority (Common Language Effect Size, CLES):
Forest plot interpretation:
Funnel plot asymmetry suggests publication bias — small studies with small effects underreported, inflating pooled d
Step 3 management: When shown a forest plot with pooled d=0.3 and tight CI, ask: (1) is 0.3 above the MCID for this outcome? (2) is heterogeneity acceptable? (3) is the outcome patient-centered? Only "yes" to all supports practice change
Board pearl: A statistically significant pooled effect with d<0.2 in a meta-analysis of >10,000 patients usually means precision has outpaced clinical meaningfulness — counsel patients honestly that benefit is real but small
Box plots and violin plots showing nearly-identical medians with significant p-values are the visual signature of large-n, small-d studies — recognize this pattern instantly
Solid White Background
Diagnostic Workup — Calculating Cohen's d and Related Measures

— d = (M₁ − M₂) / SD_pooled

— SD_pooled = √[((n₁−1)SD₁² + (n₂−1)SD₂²) / (n₁+n₂−2)]

— Treatment: mean HAM-D reduction = 10, SD = 6, n=100

— Placebo: mean HAM-D reduction = 7, SD = 5, n=100

— SD_pooled ≈ √[(99×36 + 99×25)/198] = √30.5 ≈ 5.52

— d = (10−7)/5.52 ≈ 0.54 → medium effect

— g = d × [1 − 3/(4(n₁+n₂)−9)]

— Use when total n < 50; otherwise g ≈ d

— Preferred when treatment alters variability (e.g., reduces variance) or when groups have unequal SDs

— Two proportions: risk difference, RR, OR, NNT, phi (φ)

— Correlations: Pearson r (r=0.1 small, 0.3 medium, 0.5 large) or

— ANOVA: eta-squared (η²) or omega-squared (ω²)

— d ≈ 2r / √(1−r²) and r ≈ d / √(d² + 4)

— d to OR (Chinn formula): OR ≈ exp(d × π/√3) ≈ exp(1.81d)

Cohen's d formula:
Worked example: New SSRI vs placebo for MDD
Hedges' g: small-sample-corrected version of d
Glass's Δ: uses only control group SD in denominator
Standardized mean difference (SMD): umbrella term used in meta-analyses; often reported as Hedges' g
Effect sizes for other data types:
Confidence interval around d is essential — a d of 0.5 with CI [0.1, 0.9] is far less convincing than d=0.5 [0.4, 0.6]
Board pearl: Memorize that d = 0.2 / 0.5 / 0.8 = small / medium / large — this is the single most testable Cohen's d fact on Step 3
Converting between metrics:
Step 3 management: When a stem gives you means, SDs, and ns, you should be able to estimate d within 0.1. Calculate SD_pooled crudely as the average SD when group sizes are similar — close enough for ranking answer choices
Solid White Background
Advanced Workup — MCID, NNT, and Anchoring Effect Sizes to Clinical Meaning

— HAM-D: ~3 points

— PHQ-9: ~5 points

— MMSE: ~1.4 points (for Alzheimer's trials)

— VAS pain (0–100): ~10–20 mm

— 6-minute walk test: ~30 m

— Anchor-based: tie change scores to a patient-reported global rating

— Distribution-based: half a SD (≈ d of 0.5) is often the MCID — Norman's "magical half SD"

— NNT = 1 / absolute risk reduction (ARR)

— Cohen's d can be transformed to NNT via Kraemer's formulas; rough mapping:

— d = 0.2 → NNT ≈ 9

— d = 0.5 → NNT ≈ 4

— d = 0.8 → NNT ≈ 3

— A d of 0.6 [0.55, 0.65] from a megatrial may be more actionable than d of 1.2 [0.3, 2.1] from a pilot

Minimal Clinically Important Difference (MCID): smallest change in an outcome patients perceive as beneficial
Effect size meets MCID via the anchor-based and distribution-based methods
Number Needed to Treat (NNT) complements Cohen's d for binary outcomes
Fragility index: how many event reversals would flip a "significant" result to non-significant — small fragility + small d = weak evidence
Confidence intervals trump point estimates:
Bayesian effect size posteriors: increasingly seen in Step 3 prep — credible intervals around d, with prior distributions explicitly stated
Key distinction: Statistical significance depends on n and variance; clinical significance depends on d crossing MCID. They are independent — a result can be one without the other
Board pearl: When a question asks "is this finding clinically meaningful?", compare the point estimate and lower 95% CI bound of the effect to the MCID. If the lower CI bound exceeds MCID, the result is robustly meaningful
Surrogate outcomes (LDL, HbA1c, BP) require demonstrated link to patient-important outcomes before a large d justifies practice change — recall torcetrapib (huge HDL effect, increased mortality)
Solid White Background
Risk Stratification — Interpreting Effect Sizes in Clinical Decision-Making

Magnitude: Is d at least small (≥0.2)? Does it cross MCID?

Precision: Is the CI narrow? Does the lower bound remain clinically meaningful?

Outcome importance: Mortality > major morbidity > QoL > surrogate

Harm profile: Even d=0.8 fails if NNH < NNT

Cost and access: Real-world feasibility

— RCT with low risk of bias and pre-registered analysis: high trust

— Open-label trial with subjective outcome: inflated d likely

— Observational study: residual confounding may inflate d

— Meta-analysis: trustworthy only if heterogeneity is low and publication bias addressed

— Public health interventions reaching millions (smoking cessation, vaccination)

— Cheap, safe interventions (e.g., daily aspirin in narrow indications)

— Prevention of catastrophic but rare outcomes

— Expensive therapies (e.g., $100K/year biologics)

— Treatments with serious adverse effects

— Invasive procedures

— Use shared decision-making, especially when d is small and harms are non-trivial

Framework for weighting effect size in clinical decisions:
Hierarchy of evidence quality affecting effect size trust:
Special situations where small d is acceptable:
Situations where large d is required:
Step 3 management: For a new drug with d=0.3 on PHQ-9, narrow CI, but cost of $500/month and 15% GI side effects → recommend against routine use; consider only if first-line therapies fail
Population vs individual effect: d reflects average treatment effect (ATE); individual patients vary
Board pearl: A trial showing d=0.2 in 20,000 patients is not more convincing than d=0.6 in 200 patients for an individual patient — population precision ≠ individual benefit. Translate to NNT and discuss
Subgroup analyses: effect sizes within subgroups (age, sex, severity) can guide personalization, but treat as hypothesis-generating unless pre-specified with adequate power
CCS pearl: When the simulation asks "discuss treatment options," and the trial evidence shows small d on a surrogate, document shared decision-making and offer behavioral/lifestyle alternatives first
Solid White Background
Pharmacotherapy of Misinterpretation — Common Errors with Effect Sizes

— p<0.05 with d=0.05 in n=100,000 → trivial clinically

— Fix: always pair p with effect size and CI

— d=0.8 with CI [0.1, 1.5] is unstable; d=0.4 with CI [0.35, 0.45] is robust

— d for antidepressants in inpatients with severe MDD ≠ d in primary care mild MDD

— Heterogeneity in baseline severity inflates or deflates d

— Field-specific norms exist: educational interventions often have d ~ 0.4; pharmacotherapy oncology endpoints may be huge (d > 1.5 for targeted therapies)

— Means without SDs are uninterpretable

— d on a 100-item symptom checklist may be inflated; d on hard mortality may seem small but reflect lives saved

— d describes magnitude; interaction terms describe modification

— High I² (>50%) means a single pooled d misleads; use random-effects and explore moderators

— Funnel plot asymmetry, Egger's test → consider trim-and-fill correction

— Niacin lowered LDL substantially (large d on surrogate) but failed to reduce CV events

Error 1: Conflating statistical and clinical significance
Error 2: Ignoring confidence intervals
Error 3: Comparing d across non-comparable populations
Error 4: Using Cohen's d benchmarks blindly
Error 5: Reporting only point estimate, omitting variability
Error 6: Assuming larger d means more important outcome
Error 7: Confusing effect size with effect modification
Error 8: Pooling heterogeneous effect sizes in meta-analysis
Error 9: Ignoring publication bias inflating pooled d
Error 10: Overinterpreting surrogate effect sizes
Board pearl: The most common Step 3 trap is a megatrial with p<0.001 and d=0.1 — the correct answer is "result is statistically but not clinically significant; do not change practice"
Key distinction: "Significant" is a statistical term; "meaningful" is a clinical one. Step 3 expects you to use both deliberately
Solid White Background
Procedures — Power Analysis and A Priori Effect Size Estimation

— n per group ≈ 16 / d²

— d=0.2 → ~400/group; d=0.5 → ~64/group; d=0.8 → ~25/group

— Risk of type II error (false negative)

— When they do find significance, effect size is often inflated (winner's curse)

— Detect trivial differences as "significant"

— Solution: pre-specify MCID and design for clinical meaningfulness, not just p<0.05

— Prior trials of similar interventions

— Pilot studies (cautious — wide CIs)

— Clinically meaningful difference / SD ratio from established MCIDs

— Pre-specify a non-inferiority margin (Δ) — often expressed as a fraction of historical effect size

— Margin must be smaller than MCID

Sample size calculations require an anticipated effect size — Cohen's d is central
Formula (simplified, two-group comparison, α=0.05, power=0.80):
Underpowered studies (small n, small expected d):
Overpowered studies (very large n):
Sources of a priori d estimates:
Equivalence and non-inferiority trials:
Adaptive designs: sample size re-estimation based on interim effect size; must be pre-registered to avoid bias
Multiple comparisons inflate type I error; Bonferroni or FDR correction adjusts α, indirectly raising the effect size threshold for significance
Step 3 management: When designing a QI project or local trial, calculate required n using expected d; under-resourced studies should focus on outcomes with larger anticipated effects or use within-subject designs (paired t-test) where d is typically larger
Within-subject (paired) d is calculated using SD of differences, not SD of raw scores — typically yields larger d than between-subjects designs for the same intervention
Board pearl: If a stem says "the study was powered to detect a clinically meaningful difference," verify whether that "meaningful" threshold equals or exceeds MCID. Studies powered for trivial effects waste resources and risk false positives
CCS pearl: Quality improvement orders in CCS-style stems should specify measurable outcomes with pre-defined effect targets — not vague "improvement"
Solid White Background
Special Populations — Effect Sizes in Elderly and Comorbid Patients

— Mean RCT age in cardiovascular trials: ~62; mean clinic patient: ~75

— Effect size measured in trial may shrink or invert in older patients

— Statin primary prevention trials show d declines with age; absolute benefit may persist due to higher baseline risk

— A treatment with d=0.6 for CV mortality may show smaller observed benefit if patients die of other causes first

— Floor effects in mild disease shrink d (no room to improve)

— Ceiling effects in severe disease also shrink d (irreversible damage)

— Drug exposure increases → effect size on efficacy AND adverse events both rise

— Risk-benefit calculation requires comparing NNT vs NNH at adjusted exposures

— Frail patients often show smaller benefit and larger harm

— Subgroup-specific effect sizes recommended for treatments in elderly

— Time-to-benefit vs life expectancy

— Patient priorities (function, independence vs longevity)

— De-prescribing when d is small and harm is incremental

Effect sizes derived from RCTs often do not generalize to elderly and multimorbid patients excluded from those trials
External validity gap:
Competing risks dilute effect size in elderly:
Baseline severity affects d:
Renal/hepatic impairment:
Polypharmacy increases variability (larger SD), shrinking standardized effect size even if raw mean difference is preserved
Frailty index as effect modifier:
Step 3 management: For an 82-year-old with multiple comorbidities considering a therapy with trial d=0.4, shift the conversation to:
Geriatric assessment before initiating therapies — even those with proven d in general populations
Board pearl: "Number needed to treat in 5 years" is more informative for elderly than relative effect size; if time-to-benefit exceeds life expectancy, withhold even high-d interventions
Key distinction: Average treatment effect (population d) ≠ conditional average treatment effect (CATE) for individuals. Heterogeneous treatment effect analyses are increasingly required by FDA for geriatric labeling
Solid White Background
Special Populations — Pediatrics, Pregnancy, and Underrepresented Groups

— d benchmarks similar (0.2/0.5/0.8) but absolute interpretation differs — small d on developmental trajectory can have lifelong consequences

— Most therapeutic trials exclude pregnant patients → effect sizes typically extrapolated from non-pregnant populations

— Pharmacokinetic changes in pregnancy (volume of distribution, GFR) may alter effect size; dosing studies use pregnancy-specific d estimates

— RCT effect sizes often derived from predominantly white, male populations

— Generalizability to Black, Hispanic, Asian populations should be questioned

BiDil (isosorbide/hydralazine) is a historical example: d for HF mortality was large in self-identified Black patients, smaller in overall population

— Aspirin primary prevention: stronger d for stroke prevention in women, MI prevention in men

— Anti-depressants: some evidence of differential d by sex/menstrual cycle phase

— Effect modification by genetic ancestry (e.g., warfarin dosing in CYP2C9/VKORC1 variants) — pharmacogenomic d varies

— Differs from adult MCID; PedsQL, CHAQ scales have validated thresholds

— Effect sizes on growth/development outcomes weighted heavily

— Acknowledge uncertainty explicitly

— Engage pediatric specialty consult or MFM

— Document shared decision-making

Pediatric trials often report effect sizes on developmental and behavioral scales (e.g., ADHD Rating Scale, Vineland)
Pregnancy:
Underrepresented groups:
Sex- and gender-specific effect sizes:
Race and ancestry:
Pediatric MCID:
Step 3 management: When applying adult trial effect sizes to children, pregnant patients, or underrepresented groups:
Board pearl: When a stem features a Hispanic woman considering a drug with a trial population that was 92% white men, the correct concern is limited external validity — recommend personalized risk discussion, not blanket extrapolation
Diversity, equity, and inclusion in trials (DEI) is now an FDA expectation — Diversity Plans required for pivotal trials of new drugs
Solid White Background
Complications — Misuse of Effect Sizes and Adverse Outcomes

— First trial of a new intervention often shows large d

— Replication studies show smaller d as bias and selection effects wash out

— Implication: do not adopt practice based on single large-d study

— Small negative studies remain unpublished

— Funnel plot asymmetry, Egger's regression detect this

— Trim-and-fill estimates "true" d if bias corrected

— Multiple outcomes tested, only the largest d reported

— Pre-registration of analysis plans mitigates this

— Post hoc subgroup with large d reframed as primary hypothesis

— Emphasizing relative risk reduction while obscuring small absolute d

— A 50% RRR sounds dramatic; if baseline risk is 0.2% → 0.1%, absolute d is trivial

— Cost and access displacement of higher-value care

— Adverse effects without commensurate benefit

— Overdiagnosis and overtreatment cascades

— Pooled d=0.3 may hide d=0.8 in responders and d=−0.2 in non-responders

— Without subgroup identification, all patients treated, but only some benefit

— Demand replication

— Examine effect size CIs across multiple trials

— Verify outcomes are patient-centered, not surrogate

— Wait for guideline endorsement when d is modest

Inflated effect sizes in early studies ("decline effect"):
Publication bias systematically inflates pooled effect sizes
P-hacking and selective outcome reporting:
HARKing (Hypothesizing After Results are Known):
Spin in trial reporting:
Adoption of low-d interventions harms patients via:
Heterogeneous treatment effects:
Step 3 management: When considering adopting a new therapy:
Board pearl: A d of 1.5 from a single industry-sponsored open-label trial should raise more suspicion than confidence — extreme effects in initial studies regress toward the mean upon replication
Patient safety harm: widespread adoption of interventions with overstated effect sizes is a documented quality and safety failure (e.g., routine peri-MI niacin, intensive glucose control in critically ill adults)
Solid White Background
When to Escalate — Statistical Consultation and Methodologic Review

— Designing a study with non-standard outcome (multilevel, longitudinal, time-to-event)

— Interpreting conflicting effect sizes across trials

— Conducting or appraising a meta-analysis with significant heterogeneity

— Sample size calculation for rare outcomes or composite endpoints

— Adaptive trial design, Bayesian analyses

— Is effect size reported alongside p-value?

— Is the CI provided?

— Is the MCID stated and justified?

— Is the outcome patient-important?

— Are subgroup effect sizes pre-specified?

— Is heterogeneity (I²) reported in meta-analyses?

— Is publication bias assessed?

— Pre-register hypotheses and analysis plans

— Define MCID a priori

— Avoid post hoc outcome shopping

— PICO question → study design → effect size + CI → MCID → applicability → harms → decision

— GRADE methodology rates evidence quality partly on effect size magnitude and precision

— Strong recommendations require larger, more precise effects

— Awaiting meta-analytic confirmation

— Reviewing professional society guidance

— Engaging local pharmacy and therapeutics committee

Indications to involve a biostatistician or methodologist:
Critical appraisal checklist for effect size reporting:
Institutional review for QI projects and local protocols:
Journal club framework for Step 3-relevant article appraisal:
Guideline incorporation:
Step 3 management: When a colleague proposes adopting a new therapy based on a trial with d=0.3 and wide CI, recommend:
Board pearl: Effect sizes interact with GRADE quality of evidence — large effects (RR <0.5 or >2) can upgrade observational evidence to moderate quality; small effects in low-quality studies remain low quality
CCS pearl: In simulated practice, ordering "consult biostatistics" is reasonable when designing a research protocol or interpreting ambiguous trial evidence affecting management
Recognize the difference between clinical and research consultation — a methodologist informs decisions, does not make them
Solid White Background
Key Differentials — Cohen's d vs Other Mean-Difference Effect Sizes

— Between-groups d ≠ within-subjects d for the same data

— Within-subjects designs typically yield larger d due to removing inter-individual variability

— Always check which version is reported before comparing across studies

Cohen's d: standardized mean difference using pooled SD; assumes equal variances; large-sample
Hedges' g: small-sample correction to Cohen's d; preferred in meta-analyses (n<50 per group); g < d slightly
Glass's Δ: uses control group SD only; preferred when intervention changes variability or when groups differ in SD
Cohen's d_av (averaged): for within-subjects designs, uses average of pre/post SDs
Cohen's d_rm (repeated measures): corrects for correlation between paired measurements; larger than d_av when correlation is high
Cohen's d_z: based on SD of difference scores; common in paired t-test reports
Common Language Effect Size (CLES) / Probability of Superiority: converts d to probability that a random treated person outperforms a random control
Cliff's delta: non-parametric analogue of Cohen's d; uses ordinal data
Mahalanobis distance: multivariate generalization of d; effect size for multiple outcomes simultaneously
Cohen's f: for ANOVA with multiple groups; f=0.1/0.25/0.4 = small/medium/large
Cohen's f²: for multiple regression; f²=0.02/0.15/0.35
Eta-squared (η²): proportion of variance explained by a factor in ANOVA; biased upward
Omega-squared (ω²): less biased variance-explained metric; preferred over η²
Partial eta-squared: controlling for other factors in ANOVA
Key distinction:
Board pearl: When comparing trials of the same intervention, ensure effect sizes use the same metric. Mixing Hedges' g from one trial with Cohen's d_rm from another is apples-to-oranges
Meta-analyses standardize all studies to a common metric (typically Hedges' g for continuous outcomes) before pooling — Step 3 stems may test recognition of this conversion step
Solid White Background
Key Differentials — Effect Sizes for Categorical and Risk Outcomes

— Difference in event proportions between groups

— Directly interpretable; basis for NNT (= 1/ARR)

— Ratio of event probabilities

— Common in cohort studies and RCTs

— RR=1 → no effect; RR<1 → protective; RR>1 → harmful

— Ratio of odds; used in case-control studies and logistic regression

— Approximates RR only when outcome is rare (<10%)

— Overestimates effect when outcome is common

— Effect size for time-to-event data (Cox regression)

— Assumes proportional hazards; check assumption

— Clinically intuitive; depends on baseline risk

— NNT < 10 typically actionable; NNT > 100 may not justify cost/harm

— Must be greater than NNT for therapy to be net beneficial

— Likelihood of Being Helped vs Harmed (LHH) = NNH/NNT

OR ≠ RR when outcomes are common — Step 3 frequently tests recognition that OR exaggerates effect for prevalent outcomes

HR ≠ RR — HR is instantaneous rate ratio, RR is cumulative risk ratio

— Cohen's d ↔ OR (Chinn): OR ≈ exp(1.81 × d)

— d=0.2 → OR≈1.43; d=0.5 → OR≈2.48; d=0.8 → OR≈4.27

Risk Difference (Absolute Risk Reduction, ARR):
Risk Ratio / Relative Risk (RR):
Odds Ratio (OR):
Hazard Ratio (HR):
Number Needed to Treat (NNT):
Number Needed to Harm (NNH):
Phi (φ) coefficient: effect size for 2×2 tables; equivalent to Pearson r for binary data
Cramer's V: generalization of phi for larger contingency tables
Cohen's h: effect size for difference between two proportions; benchmarks 0.2/0.5/0.8
Cohen's w: effect size for chi-square goodness-of-fit
Key distinction:
Board pearl: A trial reporting RR=0.5 but ARR=0.001 has a misleading-sounding "50% reduction" with NNT=1000 — always demand the absolute effect size to assess clinical importance
Converting between effect size families:
Solid White Background
Secondary Prevention — Applying Effect Sizes to Long-Term Management

— Statins for primary prevention: meaningful d emerges after ~2.5 years

— Tight glycemic control: microvascular benefit at 5–10 years; macrovascular even later

— Withhold long-term therapies when life expectancy is shorter than time-to-benefit

— Hazard ratios assume constant effect — check proportional hazards

— Some interventions have early benefit that wanes (legacy effect — e.g., intensive early DM control)

— Cardiovascular polypill components have individual effect sizes; combined effect is often less than additive due to overlapping pathways and adherence challenges

— Trial efficacy (per-protocol d) > real-world effectiveness (intention-to-treat d)

— Discharge medications with poor adherence profiles lose effect size in practice

— Confirm patient understanding of expected benefit magnitude

— Set realistic expectations (e.g., "this lowers your risk from 12% to 9% over 5 years")

— Discuss when to stop if benefit unrealized or harms emerge

— Antihypertensives in patients with frailty and orthostasis (NNH rises)

— Bisphosphonates after 5 years (drug holiday — sustained effect)

— Statins in last year of life (no time to realize benefit)

— Indication for each medication

— Expected benefit (NNT or absolute risk reduction)

— Duration of therapy

— Plan for reassessment

Effect sizes guide long-term therapy decisions where benefit accrues over years
Time-to-benefit analyses:
Cumulative effect size over time:
Combined effect sizes in polypharmacy:
Adherence dilutes observed effect size:
Discharge planning for therapies with modest d:
De-prescribing when effect size diminishes:
Step 3 management: On every discharge, document:
Board pearl: Long-term medications justified by small d on surrogate outcomes should be periodically re-evaluated — patients change, evidence evolves, and what was justified at 60 may not be at 85
CCS pearl: Add follow-up appointments specifically to reassess therapies with modest effect sizes; "lifelong" is a clinical decision, not a default
Solid White Background
Follow-Up and Monitoring — Tracking Outcomes Against Effect Size Expectations

— Antidepressants: PHQ-9 at 2, 4, 6 weeks

— Antihypertensives: BP at 2–4 weeks

— Statins: lipid panel at 4–12 weeks

— Set targets based on MCID (e.g., ≥5 point PHQ-9 reduction)

— If patient does not meet MCID by expected timepoint, reassess diagnosis, adherence, dose, alternatives

— Practice-level effect sizes (panel HbA1c reduction, vaccination rates) for QI

— Compare local d to benchmarks; investigate gaps

— Use plain language and CLES: "Out of every 100 patients like you, this medicine helps about 15 more compared to no treatment"

— Visual aids (icon arrays) communicate absolute effects better than relative

— Antidepressants for mild depression: d≈0.1–0.2 vs placebo → emphasize lifestyle, therapy

— Antidepressants for severe depression: d≈0.5+ → pharmacotherapy clearly indicated

— Pulmonary rehab in COPD: d≈0.6 on dyspnea, QoL — clinically robust

— Cardiac rehab post-MI: d≈0.4–0.5 on exercise capacity, mortality benefit ~20% RRR

— Failure to achieve MCID after adequate trial duration

— Adverse effects exceed anticipated benefit

— Patient preference shift

— Quantitative scale score (PHQ-9, BP, HbA1c)

— Comparison to baseline and to MCID

— Decision rationale

Patient-reported outcomes (PROs) should be tracked at intervals matched to expected effect emergence
MCID-based monitoring:
Population-level monitoring:
Counseling patients about effect size:
Setting realistic expectations:
Rehabilitation outcomes:
When to stop a therapy:
Step 3 management: When monitoring response, document:
Board pearl: "Treatment failure" is best defined as inability to achieve the MCID within the expected timeframe — not absence of statistical change. Use clinically anchored thresholds in your follow-up notes
Telehealth and remote monitoring increasingly capture continuous outcome data — effect sizes can be tracked in near-real time at population scale
Solid White Background
Ethical, Legal, and Patient Safety Considerations

— Disclose absolute benefit (ARR, NNT), not just relative effects

— Quote specific numbers when available: "About 3 in 100 patients avoid a heart attack over 5 years"

— Failure to disclose magnitude when known is an ethical breach; courts have ruled this as a failure of material disclosure

— "50% reduction" framing without absolute context misleads patients into overestimating benefit

— Direct-to-consumer ads regulated by FDA but loopholes persist

— Accelerated FDA approval based on surrogate effect sizes (e.g., aducanumab for Alzheimer's) — ethical debate about premature adoption when patient-important effect sizes are unproven

— Trials must be adequately powered to detect clinically meaningful effects — underpowered studies expose subjects to risk without scientific yield

— Ethically required to publish negative results to prevent publication bias and inflated meta-analytic effect sizes

— Clinical equipoise required for RCT enrollment — if prior evidence shows large d favoring one arm, randomization becomes unethical

— Stopping rules based on interim effect size estimates (DSMB review)

— Patients discharged on medications with small effect sizes may be confused about purpose

— Reconciliation should include indication and expected benefit; de-prescribe when effect is uncertain

— Adverse events meaningfully larger than trial-reported harms (i.e., real-world d for harm exceeds expected) require FDA MedWatch reporting

— Use plain-language absolute numbers

— Document patient understanding

— Offer alternatives including no treatment

Informed consent must include effect size in interpretable terms:
Misleading communication of effect sizes is a patient safety hazard:
Surrogate endpoint controversies:
Research ethics:
Equipoise and effect size:
Transitions of care risk:
Mandatory reporting:
Step 3 management: When obtaining consent for a procedure or therapy with modest d:
Board pearl: Patient autonomy is honored only when patients receive quantitatively honest effect size information — vague "this will help you" violates informed consent standards
Equity: subgroup effect sizes informing decisions must not disadvantage groups underrepresented in trials
Solid White Background
High-Yield Associations and Rapid-Fire Clinical Facts
Cohen's d benchmarks: 0.2 / 0.5 / 0.8 = small / medium / large
d = (M₁ − M₂) / SD_pooled
Hedges' g: small-sample-corrected d; use when n<50
Glass's Δ: uses control SD; preferred with unequal variances
d → NNT rough mapping: 0.2 → 9; 0.5 → 4; 0.8 → 3
d → CLES: 0.2 → 56%; 0.5 → 64%; 0.8 → 71% probability of superiority
d → OR (Chinn): OR ≈ exp(1.81 × d)
Sample size shortcut: n/group ≈ 16/d² (α=0.05, power=0.80)
MCID examples: HAM-D 3 pts, PHQ-9 5 pts, MMSE 1.4 pts, VAS pain 10–20 mm, 6MWT 30 m
"Half SD rule": MCID ≈ 0.5 SD (Norman) — d≈0.5 is a common MCID anchor
Statistical vs clinical significance: independent dimensions
OR exaggerates RR when outcome >10% prevalence
HR for time-to-event; assumes proportional hazards
Phi (φ) for 2×2 tables = Pearson r for binary data
ANOVA: Cohen's f benchmarks 0.1/0.25/0.4
Regression: Cohen's f² benchmarks 0.02/0.15/0.35
Pearson r benchmarks: 0.1/0.3/0.5 = small/medium/large
Within-subjects d typically > between-subjects d
Random-effects meta-analysis when I² >50%
Publication bias inflates pooled d; funnel plot + Egger's
Decline effect: replication shrinks initial large d
Spin: emphasize RRR, hide trivial ARR
Time-to-benefit > life expectancy → withhold even high-d therapy
Pre-registration prevents HARKing and selective outcome reporting
GRADE: large effect (RR <0.5 or >2) can upgrade observational evidence
Surrogate vs patient-important outcome: high d on surrogate ≠ benefit
Equivalence margin must be < MCID
Fragility index: low fragility = robust effect
Board pearl: If you remember only one thing — pair every p-value with an effect size and a CI; never let either stand alone
Solid White Background
Board Question Stem Patterns

— Stem: "30,000 patients, SBP reduction 1.2 mmHg, p<0.001, Cohen's d=0.08"

— Question: "Most appropriate interpretation?"

— Answer: "Statistically significant but not clinically meaningful; do not adopt"

— Stem: "40 patients, d=1.2, 95% CI [0.2, 2.2], p=0.04"

— Question: "Best next step?"

— Answer: "Await replication in larger trials; results imprecise"

— Stem: Common outcome (~30%), OR=2.5 reported

— Question: "What is the true relative risk?"

— Answer: RR < OR because outcome is common; OR overestimates

— Stem: "Drug reduces risk of stroke by 40%" but baseline is 0.5%

— Question: "NNT?"

— Answer: ARR = 0.2%, NNT = 500 — likely not cost-effective for low-risk patients

— Stem: Pooled d=0.4, I²=75%

— Question: "Best interpretation?"

— Answer: Pooled estimate masks variability; investigate moderators

— Stem: d=0.6, MCID=0.5 SD, narrow CI

— Answer: Clinically meaningful; appropriate to consider

— Stem: "Expecting d=0.5, want power 0.80, α=0.05"

— Answer: ~64 per group (n ≈ 16/d²)

— Stem: Drug lowers LDL by huge d, mortality unchanged

— Answer: Do not assume CV benefit from LDL effect

— Stem: Overall d=0.2, in women d=0.7

— Answer: Hypothesis-generating unless pre-specified

— Stem: n=50, observed d=0.3, p=0.12

— Answer: Failed to reach significance, but effect size suggests real effect; need larger study

Pattern 1: Megatrial, tiny effect
Pattern 2: Small pilot, large effect, wide CI
Pattern 3: OR vs RR confusion
Pattern 4: RRR without ARR
Pattern 5: Meta-analysis with high heterogeneity
Pattern 6: Effect size meets MCID
Pattern 7: Power calculation
Pattern 8: Surrogate vs hard outcome
Pattern 9: Subgroup analysis
Pattern 10: Underpowered negative trial
Board pearl: Step 3 biostatistics questions almost always have an answer choice that says "statistically significant but not clinically meaningful" — when present, this is correct ~80% of the time in megatrial stems
Solid White Background
One-Line Recap

Cohen's d standardizes the magnitude of a between-group mean difference in SD units (small 0.2 / medium 0.5 / large 0.8), and must be paired with confidence intervals, the minimal clinically important difference, and patient-centered outcomes to translate statistical results into sound clinical decisions.

Core formula: d = (M₁ − M₂) / SD_pooled; Hedges' g for small samples; Glass's Δ for unequal variances; within-subjects designs typically yield larger d than between-subjects
Statistical vs clinical significance are independent: p-values reflect sample size and precision; effect sizes reflect magnitude. A megatrial can produce p<0.001 with d=0.05 (trivial); a pilot can show d=1.0 with p=0.10 (promising but imprecise). Always demand both
Translate to patient terms: convert d to NNT (≈16/d² for power calc; ≈9/4/3 for d=0.2/0.5/0.8 for NNT) and to common-language probability of superiority (56%/64%/71%) for shared decision-making; compare point estimate and lower CI bound to the MCID before changing practice
Step 3 traps to avoid: mistaking RRR for absolute benefit, applying OR as if it were RR for common outcomes, adopting therapies based on single inflated initial trials (decline effect), ignoring heterogeneity in meta-analyses (I²>50%), and extending trial effect sizes to elderly/pregnant/underrepresented patients without acknowledging external validity limits
Board pearl: When a stem provides both p-value and effect size, the test writer wants you to weight the effect size — and when it provides a megatrial with a tiny d, the right answer is almost always "statistically significant but not clinically meaningful; do not change management"
Solid White Background
bottom of page