Biostatistics & Population Health
Effect size measures: Cohen's d and clinical interpretation
— A trial with n=50,000 can produce p<0.001 for a 1 mmHg BP reduction — statistically significant, clinically trivial
— A small pilot study may show a large effect that fails to reach significance due to underpowering
— Formula: d = (M₁ − M₂) / SD_pooled
— Unitless, allowing comparison across studies measuring different scales (e.g., HAM-D vs PHQ-9 for depression)
— Interpreting trial results where p-value alone is given but clinical meaningfulness is questioned
— Meta-analyses pooling outcomes across heterogeneous instruments
— Comparing two interventions both "statistically significant" but with different magnitudes
— Sample size / power calculations during study design
— Mean differences: Cohen's d, Hedges' g (small-sample correction), Glass's Δ
— Correlations: Pearson r, R²
— Categorical/risk: odds ratio, risk ratio, NNT, phi coefficient

— "A trial of drug X vs placebo for depression shows mean HAM-D reduction of 2 points (p=0.01, Cohen's d=0.18)..."
— "A new antihypertensive lowers SBP by 1.5 mmHg compared to standard care in 30,000 patients (p<0.001)..."
— "Cognitive behavioral therapy vs waitlist for anxiety: d=0.85, p=0.04, n=40..."
— Meta-analysis forest plots with pooled standardized mean differences (SMDs)
— Sample size (n) — large n inflates statistical significance even for trivial effects
— Standard deviation of the outcome — needed to contextualize raw differences
— Outcome scale (HAM-D, MMSE, VAS pain, BP, HbA1c) — different scales require standardization
— MCID — sometimes given explicitly ("a 3-point change on HAM-D is clinically meaningful")
— Confidence interval around the effect size
— d ≈ 0.2: small effect
— d ≈ 0.5: medium effect
— d ≈ 0.8: large effect
— d > 1.2: very large; d < 0.1: trivial
— Huge sample sizes with tiny absolute differences
— Significant p-values where the raw difference is below the MCID
— Multiple "positive" trials with conflicting magnitudes

— d = 0.2: ~85% overlap between group distributions — most patients in treatment group look like most in control
— d = 0.5: ~67% overlap — noticeable but substantial overlap
— d = 0.8: ~53% overlap — distributions clearly separated but still meaningful overlap
— d = 2.0: ~32% overlap — minimal overlap, dramatic separation
— d = 0.2 → 56% chance a random treated patient does better than a random control
— d = 0.5 → 64%
— d = 0.8 → 71%
— Useful for explaining trial results to patients in plain language
— X-axis often labeled SMD or Hedges' g
— Diamond crossing zero = no significant pooled effect
— Width of diamond = 95% CI; narrow = precise
— Heterogeneity (I²) >50% means pooled d may mask varying true effects

— d = (M₁ − M₂) / SD_pooled
— SD_pooled = √[((n₁−1)SD₁² + (n₂−1)SD₂²) / (n₁+n₂−2)]
— Treatment: mean HAM-D reduction = 10, SD = 6, n=100
— Placebo: mean HAM-D reduction = 7, SD = 5, n=100
— SD_pooled ≈ √[(99×36 + 99×25)/198] = √30.5 ≈ 5.52
— d = (10−7)/5.52 ≈ 0.54 → medium effect
— g = d × [1 − 3/(4(n₁+n₂)−9)]
— Use when total n < 50; otherwise g ≈ d
— Preferred when treatment alters variability (e.g., reduces variance) or when groups have unequal SDs
— Two proportions: risk difference, RR, OR, NNT, phi (φ)
— Correlations: Pearson r (r=0.1 small, 0.3 medium, 0.5 large) or R²
— ANOVA: eta-squared (η²) or omega-squared (ω²)
— d ≈ 2r / √(1−r²) and r ≈ d / √(d² + 4)
— d to OR (Chinn formula): OR ≈ exp(d × π/√3) ≈ exp(1.81d)

— HAM-D: ~3 points
— PHQ-9: ~5 points
— MMSE: ~1.4 points (for Alzheimer's trials)
— VAS pain (0–100): ~10–20 mm
— 6-minute walk test: ~30 m
— Anchor-based: tie change scores to a patient-reported global rating
— Distribution-based: half a SD (≈ d of 0.5) is often the MCID — Norman's "magical half SD"
— NNT = 1 / absolute risk reduction (ARR)
— Cohen's d can be transformed to NNT via Kraemer's formulas; rough mapping:
— d = 0.2 → NNT ≈ 9
— d = 0.5 → NNT ≈ 4
— d = 0.8 → NNT ≈ 3
— A d of 0.6 [0.55, 0.65] from a megatrial may be more actionable than d of 1.2 [0.3, 2.1] from a pilot

— Magnitude: Is d at least small (≥0.2)? Does it cross MCID?
— Precision: Is the CI narrow? Does the lower bound remain clinically meaningful?
— Outcome importance: Mortality > major morbidity > QoL > surrogate
— Harm profile: Even d=0.8 fails if NNH < NNT
— Cost and access: Real-world feasibility
— RCT with low risk of bias and pre-registered analysis: high trust
— Open-label trial with subjective outcome: inflated d likely
— Observational study: residual confounding may inflate d
— Meta-analysis: trustworthy only if heterogeneity is low and publication bias addressed
— Public health interventions reaching millions (smoking cessation, vaccination)
— Cheap, safe interventions (e.g., daily aspirin in narrow indications)
— Prevention of catastrophic but rare outcomes
— Expensive therapies (e.g., $100K/year biologics)
— Treatments with serious adverse effects
— Invasive procedures
— Use shared decision-making, especially when d is small and harms are non-trivial

— p<0.05 with d=0.05 in n=100,000 → trivial clinically
— Fix: always pair p with effect size and CI
— d=0.8 with CI [0.1, 1.5] is unstable; d=0.4 with CI [0.35, 0.45] is robust
— d for antidepressants in inpatients with severe MDD ≠ d in primary care mild MDD
— Heterogeneity in baseline severity inflates or deflates d
— Field-specific norms exist: educational interventions often have d ~ 0.4; pharmacotherapy oncology endpoints may be huge (d > 1.5 for targeted therapies)
— Means without SDs are uninterpretable
— d on a 100-item symptom checklist may be inflated; d on hard mortality may seem small but reflect lives saved
— d describes magnitude; interaction terms describe modification
— High I² (>50%) means a single pooled d misleads; use random-effects and explore moderators
— Funnel plot asymmetry, Egger's test → consider trim-and-fill correction
— Niacin lowered LDL substantially (large d on surrogate) but failed to reduce CV events

— n per group ≈ 16 / d²
— d=0.2 → ~400/group; d=0.5 → ~64/group; d=0.8 → ~25/group
— Risk of type II error (false negative)
— When they do find significance, effect size is often inflated (winner's curse)
— Detect trivial differences as "significant"
— Solution: pre-specify MCID and design for clinical meaningfulness, not just p<0.05
— Prior trials of similar interventions
— Pilot studies (cautious — wide CIs)
— Clinically meaningful difference / SD ratio from established MCIDs
— Pre-specify a non-inferiority margin (Δ) — often expressed as a fraction of historical effect size
— Margin must be smaller than MCID

— Mean RCT age in cardiovascular trials: ~62; mean clinic patient: ~75
— Effect size measured in trial may shrink or invert in older patients
— Statin primary prevention trials show d declines with age; absolute benefit may persist due to higher baseline risk
— A treatment with d=0.6 for CV mortality may show smaller observed benefit if patients die of other causes first
— Floor effects in mild disease shrink d (no room to improve)
— Ceiling effects in severe disease also shrink d (irreversible damage)
— Drug exposure increases → effect size on efficacy AND adverse events both rise
— Risk-benefit calculation requires comparing NNT vs NNH at adjusted exposures
— Frail patients often show smaller benefit and larger harm
— Subgroup-specific effect sizes recommended for treatments in elderly
— Time-to-benefit vs life expectancy
— Patient priorities (function, independence vs longevity)
— De-prescribing when d is small and harm is incremental

— d benchmarks similar (0.2/0.5/0.8) but absolute interpretation differs — small d on developmental trajectory can have lifelong consequences
— Most therapeutic trials exclude pregnant patients → effect sizes typically extrapolated from non-pregnant populations
— Pharmacokinetic changes in pregnancy (volume of distribution, GFR) may alter effect size; dosing studies use pregnancy-specific d estimates
— RCT effect sizes often derived from predominantly white, male populations
— Generalizability to Black, Hispanic, Asian populations should be questioned
— BiDil (isosorbide/hydralazine) is a historical example: d for HF mortality was large in self-identified Black patients, smaller in overall population
— Aspirin primary prevention: stronger d for stroke prevention in women, MI prevention in men
— Anti-depressants: some evidence of differential d by sex/menstrual cycle phase
— Effect modification by genetic ancestry (e.g., warfarin dosing in CYP2C9/VKORC1 variants) — pharmacogenomic d varies
— Differs from adult MCID; PedsQL, CHAQ scales have validated thresholds
— Effect sizes on growth/development outcomes weighted heavily
— Acknowledge uncertainty explicitly
— Engage pediatric specialty consult or MFM
— Document shared decision-making

— First trial of a new intervention often shows large d
— Replication studies show smaller d as bias and selection effects wash out
— Implication: do not adopt practice based on single large-d study
— Small negative studies remain unpublished
— Funnel plot asymmetry, Egger's regression detect this
— Trim-and-fill estimates "true" d if bias corrected
— Multiple outcomes tested, only the largest d reported
— Pre-registration of analysis plans mitigates this
— Post hoc subgroup with large d reframed as primary hypothesis
— Emphasizing relative risk reduction while obscuring small absolute d
— A 50% RRR sounds dramatic; if baseline risk is 0.2% → 0.1%, absolute d is trivial
— Cost and access displacement of higher-value care
— Adverse effects without commensurate benefit
— Overdiagnosis and overtreatment cascades
— Pooled d=0.3 may hide d=0.8 in responders and d=−0.2 in non-responders
— Without subgroup identification, all patients treated, but only some benefit
— Demand replication
— Examine effect size CIs across multiple trials
— Verify outcomes are patient-centered, not surrogate
— Wait for guideline endorsement when d is modest

— Designing a study with non-standard outcome (multilevel, longitudinal, time-to-event)
— Interpreting conflicting effect sizes across trials
— Conducting or appraising a meta-analysis with significant heterogeneity
— Sample size calculation for rare outcomes or composite endpoints
— Adaptive trial design, Bayesian analyses
— Is effect size reported alongside p-value?
— Is the CI provided?
— Is the MCID stated and justified?
— Is the outcome patient-important?
— Are subgroup effect sizes pre-specified?
— Is heterogeneity (I²) reported in meta-analyses?
— Is publication bias assessed?
— Pre-register hypotheses and analysis plans
— Define MCID a priori
— Avoid post hoc outcome shopping
— PICO question → study design → effect size + CI → MCID → applicability → harms → decision
— GRADE methodology rates evidence quality partly on effect size magnitude and precision
— Strong recommendations require larger, more precise effects
— Awaiting meta-analytic confirmation
— Reviewing professional society guidance
— Engaging local pharmacy and therapeutics committee

— Between-groups d ≠ within-subjects d for the same data
— Within-subjects designs typically yield larger d due to removing inter-individual variability
— Always check which version is reported before comparing across studies

— Difference in event proportions between groups
— Directly interpretable; basis for NNT (= 1/ARR)
— Ratio of event probabilities
— Common in cohort studies and RCTs
— RR=1 → no effect; RR<1 → protective; RR>1 → harmful
— Ratio of odds; used in case-control studies and logistic regression
— Approximates RR only when outcome is rare (<10%)
— Overestimates effect when outcome is common
— Effect size for time-to-event data (Cox regression)
— Assumes proportional hazards; check assumption
— Clinically intuitive; depends on baseline risk
— NNT < 10 typically actionable; NNT > 100 may not justify cost/harm
— Must be greater than NNT for therapy to be net beneficial
— Likelihood of Being Helped vs Harmed (LHH) = NNH/NNT
— OR ≠ RR when outcomes are common — Step 3 frequently tests recognition that OR exaggerates effect for prevalent outcomes
— HR ≠ RR — HR is instantaneous rate ratio, RR is cumulative risk ratio
— Cohen's d ↔ OR (Chinn): OR ≈ exp(1.81 × d)
— d=0.2 → OR≈1.43; d=0.5 → OR≈2.48; d=0.8 → OR≈4.27

— Statins for primary prevention: meaningful d emerges after ~2.5 years
— Tight glycemic control: microvascular benefit at 5–10 years; macrovascular even later
— Withhold long-term therapies when life expectancy is shorter than time-to-benefit
— Hazard ratios assume constant effect — check proportional hazards
— Some interventions have early benefit that wanes (legacy effect — e.g., intensive early DM control)
— Cardiovascular polypill components have individual effect sizes; combined effect is often less than additive due to overlapping pathways and adherence challenges
— Trial efficacy (per-protocol d) > real-world effectiveness (intention-to-treat d)
— Discharge medications with poor adherence profiles lose effect size in practice
— Confirm patient understanding of expected benefit magnitude
— Set realistic expectations (e.g., "this lowers your risk from 12% to 9% over 5 years")
— Discuss when to stop if benefit unrealized or harms emerge
— Antihypertensives in patients with frailty and orthostasis (NNH rises)
— Bisphosphonates after 5 years (drug holiday — sustained effect)
— Statins in last year of life (no time to realize benefit)
— Indication for each medication
— Expected benefit (NNT or absolute risk reduction)
— Duration of therapy
— Plan for reassessment

— Antidepressants: PHQ-9 at 2, 4, 6 weeks
— Antihypertensives: BP at 2–4 weeks
— Statins: lipid panel at 4–12 weeks
— Set targets based on MCID (e.g., ≥5 point PHQ-9 reduction)
— If patient does not meet MCID by expected timepoint, reassess diagnosis, adherence, dose, alternatives
— Practice-level effect sizes (panel HbA1c reduction, vaccination rates) for QI
— Compare local d to benchmarks; investigate gaps
— Use plain language and CLES: "Out of every 100 patients like you, this medicine helps about 15 more compared to no treatment"
— Visual aids (icon arrays) communicate absolute effects better than relative
— Antidepressants for mild depression: d≈0.1–0.2 vs placebo → emphasize lifestyle, therapy
— Antidepressants for severe depression: d≈0.5+ → pharmacotherapy clearly indicated
— Pulmonary rehab in COPD: d≈0.6 on dyspnea, QoL — clinically robust
— Cardiac rehab post-MI: d≈0.4–0.5 on exercise capacity, mortality benefit ~20% RRR
— Failure to achieve MCID after adequate trial duration
— Adverse effects exceed anticipated benefit
— Patient preference shift
— Quantitative scale score (PHQ-9, BP, HbA1c)
— Comparison to baseline and to MCID
— Decision rationale

— Disclose absolute benefit (ARR, NNT), not just relative effects
— Quote specific numbers when available: "About 3 in 100 patients avoid a heart attack over 5 years"
— Failure to disclose magnitude when known is an ethical breach; courts have ruled this as a failure of material disclosure
— "50% reduction" framing without absolute context misleads patients into overestimating benefit
— Direct-to-consumer ads regulated by FDA but loopholes persist
— Accelerated FDA approval based on surrogate effect sizes (e.g., aducanumab for Alzheimer's) — ethical debate about premature adoption when patient-important effect sizes are unproven
— Trials must be adequately powered to detect clinically meaningful effects — underpowered studies expose subjects to risk without scientific yield
— Ethically required to publish negative results to prevent publication bias and inflated meta-analytic effect sizes
— Clinical equipoise required for RCT enrollment — if prior evidence shows large d favoring one arm, randomization becomes unethical
— Stopping rules based on interim effect size estimates (DSMB review)
— Patients discharged on medications with small effect sizes may be confused about purpose
— Reconciliation should include indication and expected benefit; de-prescribe when effect is uncertain
— Adverse events meaningfully larger than trial-reported harms (i.e., real-world d for harm exceeds expected) require FDA MedWatch reporting
— Use plain-language absolute numbers
— Document patient understanding
— Offer alternatives including no treatment


— Stem: "30,000 patients, SBP reduction 1.2 mmHg, p<0.001, Cohen's d=0.08"
— Question: "Most appropriate interpretation?"
— Answer: "Statistically significant but not clinically meaningful; do not adopt"
— Stem: "40 patients, d=1.2, 95% CI [0.2, 2.2], p=0.04"
— Question: "Best next step?"
— Answer: "Await replication in larger trials; results imprecise"
— Stem: Common outcome (~30%), OR=2.5 reported
— Question: "What is the true relative risk?"
— Answer: RR < OR because outcome is common; OR overestimates
— Stem: "Drug reduces risk of stroke by 40%" but baseline is 0.5%
— Question: "NNT?"
— Answer: ARR = 0.2%, NNT = 500 — likely not cost-effective for low-risk patients
— Stem: Pooled d=0.4, I²=75%
— Question: "Best interpretation?"
— Answer: Pooled estimate masks variability; investigate moderators
— Stem: d=0.6, MCID=0.5 SD, narrow CI
— Answer: Clinically meaningful; appropriate to consider
— Stem: "Expecting d=0.5, want power 0.80, α=0.05"
— Answer: ~64 per group (n ≈ 16/d²)
— Stem: Drug lowers LDL by huge d, mortality unchanged
— Answer: Do not assume CV benefit from LDL effect
— Stem: Overall d=0.2, in women d=0.7
— Answer: Hypothesis-generating unless pre-specified
— Stem: n=50, observed d=0.3, p=0.12
— Answer: Failed to reach significance, but effect size suggests real effect; need larger study

Cohen's d standardizes the magnitude of a between-group mean difference in SD units (small 0.2 / medium 0.5 / large 0.8), and must be paired with confidence intervals, the minimal clinically important difference, and patient-centered outcomes to translate statistical results into sound clinical decisions.

