Biostatistics & Population Health
Subgroup analysis pitfalls and interpretation
— Overall trial result is null but a "positive" subgroup is highlighted (post-hoc rescue).
— Overall trial is positive but a single subgroup appears not to benefit (post-hoc exclusion).
— Many subgroups tested without adjustment (multiplicity).
— Subgroup defined by a post-randomization variable (e.g., adherence, on-treatment LDL) — this breaks randomization.
— Subgroup not pre-specified in the statistical analysis plan.
— No formal test of interaction reported; only within-subgroup p-values shown.
— Was the subgroup pre-specified?
— Was there a significant interaction test (p-interaction)?
— Is the subgroup biologically plausible?
— Is the finding consistent across related trials/meta-analyses?
— How many subgroups were tested (multiplicity)?

— "Trial showed no overall benefit, but in patients >75 the hazard ratio was 0.7 (p=0.04)." → Tempting but suspect.
— "Drug reduced mortality overall; however, women showed HR 1.1 (p=0.6)." → Likely insufficient power, not true absence of effect.
— "Post-hoc analysis suggested benefit only in patients with elevated CRP." → Hypothesis-generating.
— "Authors performed 18 subgroup analyses; one was significant at p<0.05." → Multiplicity (expect ~1 by chance).
— Was the analysis pre-specified in the protocol vs. data-driven?
— Was it stratified at randomization or analyzed only afterward?
— Number of subgroups examined (denominator for multiplicity).
— Reporting of p-for-interaction vs. only within-group p-values.
— Direction and magnitude consistent with overall effect?
— "Post-hoc," "exploratory," "data-driven," "unplanned," "subset that emerged."
— "On-treatment analysis" or "per-protocol subgroup" — non-randomized comparison hidden as a subgroup.

— Point estimates (squares) for each subgroup with 95% CIs (whiskers).
— Overall effect (diamond) at the bottom.
— Vertical line of no effect (HR/RR = 1.0).
— p-for-interaction typically on the right side.
— Do CIs cross 1.0 in subgroups? Usually yes — reflects underpowering, not absence of effect.
— Are point estimates on the same side of the line? Consistent direction supports overall effect generalizing.
— Is there a qualitative interaction (effect reverses direction — e.g., benefit in men, harm in women)? Rare and demands scrutiny.
— Is there a quantitative interaction (same direction, different magnitude)? More common and more believable.
— No p-for-interaction reported → suspect cherry-picking.
— No mention of pre-specification → likely post-hoc.
— No correction for multiple comparisons when ≥5 subgroups tested.
— "Significant" subgroup but overlapping CIs with non-significant subgroup → effects not actually different.
— Each subgroup has roughly half (or less) the n of the overall trial → wide CIs and unstable estimates are expected.
— A subgroup with n=200 cannot reliably detect a 15% relative risk reduction even if real.

— Check the published protocol or statistical analysis plan (SAP) dated before unblinding.
— Pre-specified subgroups are listed by name and rationale.
— Post-hoc subgroups are identified only after looking at the data — major credibility downgrade.
— Must be a pre-randomization characteristic (age, sex, comorbidity, baseline biomarker).
— Variables measured after randomization (adherence, on-treatment LDL, side effects) are not valid subgroup variables — they break the randomized comparison and introduce confounding.
— With α=0.05, expect ~1 false-positive per 20 independent tests.
— Trials often report 10–20 subgroups; at least one "significant" finding is the null expectation.
— Adjustment methods: Bonferroni, Holm, or pre-specifying a small number of primary subgroups.
— p-for-interaction tests whether effect size differs across strata.
— Significant p-interaction (<0.05, sometimes <0.10 given low power) = effect modification plausible.
— Non-significant p-interaction even with a "positive" subgroup = effect likely homogeneous; apply overall result.
— Does mechanism predict the subgroup difference (e.g., EGFR-mutated NSCLC responding to EGFR inhibitors)?
— Or is the subgroup demographic with no biological rationale (zodiac sign, day of week)?

— Formal statistical test asking: "Does the treatment effect differ across subgroup levels?"
— In regression models: include a treatment × subgroup interaction term; its p-value is the p-interaction.
— Low power — often interpreted at α=0.10 rather than 0.05.
— Bonferroni: divide α by number of tests (simple, conservative).
— Holm-Bonferroni: sequential, less conservative.
— False discovery rate (Benjamini-Hochberg): controls expected proportion of false positives — useful when many subgroups.
— Effect modification (interaction): real biological/clinical phenomenon — the treatment truly works differently in different groups. Report stratified estimates.
— Confounding: distortion of an association by a third variable. Adjust or stratify it away. Different concept entirely.
— In meta-analysis, I² quantifies heterogeneity across trials.
— In a single trial, the analogous concept is interaction across subgroups.
— Subgroup-specific estimates can be "shrunk" toward the overall trial estimate, recognizing that extreme subgroup results are usually regression-to-the-mean artifacts.
— Produces more conservative, reliable subgroup estimates.
— A subgroup finding gains credibility when replicated in independent trials or pooled meta-analyses.
— Single-trial subgroup signals should rarely change practice.

— Pre-specified in the protocol.
— Significant p-for-interaction.
— Strong biological rationale (e.g., genetic marker predicting drug response).
— Consistent across multiple trials or meta-analyses.
— Few subgroups tested overall (low multiplicity burden).
— Quantitative interaction (same direction, different magnitude).
— Pre-specified but no replication yet.
— Borderline p-for-interaction (0.05–0.10).
— Plausible mechanism but limited prior data.
— Post-hoc / data-driven.
— Many subgroups tested without adjustment.
— No interaction test or interaction p > 0.10.
— Qualitative interaction (effect reverses direction) without biological explanation.
— Subgroup defined by post-randomization variable.
— Inconsistent across trials.

— Translate average treatment effect (ATE) from the trial into an estimate of individual treatment effect for your patient.
— The overall ATE is usually the best available estimate for any individual unless strong effect modification is established.
— Patient has a biomarker with validated predictive value (e.g., EGFR mutation, HER2 status, BRCA, CFTR genotype).
— Patient is in a population explicitly excluded from the trial (extrapolation caution, not subgroup analysis).
— Validated risk-based heterogeneity of treatment effect (HTE): patients at higher baseline risk often have larger absolute benefit even when relative risk reduction is constant.
— Relative risk reduction (RRR) is often similar across subgroups.
— Absolute risk reduction (ARR) varies with baseline risk → high-risk subgroups have larger ARR and smaller NNT, even without a true interaction.
— This is risk-based HTE, not statistical effect modification — and it's a legitimate basis for personalization.
— RRR ~25% across risk strata.
— ARR much larger in patients with 10-year ASCVD risk >20% than <5%.
— Guidelines (ACC/AHA) use this principle to set treatment thresholds — not subgroup analyses per se.
— CHA₂DS₂-VASc stratifies absolute stroke risk; benefit of anticoagulation is larger in absolute terms at higher scores, even though RRR is roughly constant.

— List subgroups before unblinding, with hypothesized direction.
— Limit to a small number (typically ≤5) of clinically and biologically justified subgroups.
— Pre-specify the interaction test as the primary subgroup analysis, not within-group p-values.
— Randomization stratified by a key variable (e.g., diabetes status, center, baseline severity) ensures balanced subgroup sizes.
— Improves power for the planned subgroup interaction analysis.
— Does NOT by itself make subgroup findings causal — but supports their validity.
— Biomarker-enrichment trials enroll only patients predicted to benefit (e.g., HER2+ in trastuzumab trials).
— Adaptive enrichment: trial begins broad, narrows enrollment to responsive subgroup based on interim analysis. Requires rigorous statistical control.
— Hierarchical testing (test primary outcome before subgroups; only test subgroups if primary positive).
— Gatekeeping procedures.
— Pre-specified α allocation across subgroups.
— CONSORT guidelines require reporting all pre-specified subgroups, with interaction tests and acknowledgment of post-hoc status.
— Forest plots with p-for-interaction are now standard.
— Do NOT define subgroups using post-baseline variables.
— Do NOT report only "significant" subgroups (selective reporting).
— Do NOT interpret non-significant within-subgroup p-values as evidence of no effect.

— Older patients are systematically underenrolled in pivotal trials (often <20% of participants are >75).
— Subgroup analyses by age are common but typically underpowered.
— A non-significant effect in the elderly subgroup almost never means the drug doesn't work — it usually means n was too small.
— Frequently analyzed because of pharmacokinetic concerns.
— Often defined by baseline eGFR or Child-Pugh class — these are valid pre-randomization variables.
— Interaction tests rarely significant; differences in efficacy usually reflect competing risks (older/sicker patients die of other causes) rather than true effect modification.
— Even when RRR is preserved, absolute benefit may be smaller if life expectancy is short — relevant to primary prevention decisions (statins, aspirin, cancer screening).
— Conversely, absolute benefit may be larger in elderly with high event rates (secondary prevention).
— USPSTF often deviates from trial subgroups by using modeled lifetime benefit — e.g., colorectal cancer screening stopping at 75 (individualized) and not recommended after 85.
— Statin primary prevention in adults >75 is "individualized" because trial subgroup data are sparse and competing risks rise.
— Elderly patients in clinical practice often differ from trial participants (more comorbidities, more drugs).
— This is external validity / generalizability, not subgroup analysis — different concept.

— Historically, women underrepresented in CV trials → subgroup analyses often appear "weaker" in women, usually reflecting power, not biology.
— Notable exception: aspirin for primary prevention — meta-analyses suggested differential effects (MI reduction in men, stroke reduction in women) that influenced earlier guidelines, though modern recommendations have shifted toward bleeding-risk based decisions.
— NIH now requires sex as a biological variable in trial design.
— Highly fraught — race is a social construct that correlates imperfectly with biology.
— Some valid findings: BiDil (isosorbide dinitrate/hydralazine) approved specifically for self-identified Black patients with HF based on A-HeFT, but this remains controversial.
— ACE inhibitors and thiazides — older subgroup data suggested differential efficacy by race; current guidelines (JNC 8/ACC-AHA) acknowledge this but emphasize individualized care.
— Pregnant patients are excluded from most trials → not a subgroup issue but an external validity / extrapolation issue.
— Treatment decisions rely on observational data, registries (e.g., MotherToBaby), and pharmacokinetic studies.
— FDA pediatric extrapolation framework: when disease and drug response are similar to adults, adult efficacy data can be extrapolated with PK/safety bridging studies.
— This is not subgroup analysis — it's a regulatory pathway recognizing limits of pediatric trials.
— EGFR, ALK, BRAF, HER2, BRCA, KRAS — biomarker-defined subgroups with strong biology and replicated effects.
— These are the textbook examples of valid effect modification changing practice.

— Most common harm: a clinician sees an underpowered "no benefit in subgroup X" finding and denies a patient proven treatment.
— Example: not prescribing statins to women or elderly because of misread subgroup data — both groups benefit per overall trial and meta-analytic evidence.
— Acting on a spurious "positive" subgroup leads to treating patients who won't benefit and may be harmed.
— Example: vitamin E for cardiovascular protection — observational subgroups suggested benefit; RCTs (HOPE, GISSI) showed none.
— "Winner's curse" — selected significant subgroups overestimate true effects.
— Subsequent confirmatory trials in the subgroup often show smaller or null effects.
— Healthcare systems may target therapy or screening based on subgroups, missing patients who would benefit or wasting resources on non-responders.
— Repeated subgroup-driven reversals (e.g., HRT in postmenopausal women — observational subgroups suggested CV benefit; WHI RCT showed harm) damage clinician and patient confidence in EBM.
— FDA may restrict indications based on subgroup findings; payers may deny coverage outside narrow groups.
— Conversely, accelerated approvals based on subgroup signals sometimes don't replicate (oncology drugs withdrawn after confirmatory trial failures).
— Documentation should reflect that treatment decisions are based on the best overall evidence, not cherry-picked subgroups.

— Designing a trial with subgroup hypotheses → SAP review essential before data lock.
— Interpreting a complex subgroup analysis with multiple interaction terms.
— Performing or reading a meta-analysis with subgroup or meta-regression analyses.
— Bayesian or hierarchical modeling for subgroup shrinkage.
— Guideline committee using subgroup data to recommend (or restrict) therapy in a specific population — request to see pre-specification, p-interaction, replication.
— Pharmaceutical marketing emphasizing subgroup benefits not reflected in primary outcome — flag to P&T committee.
— Institutional protocol changes driven by single-trial subgroup findings — request systematic review.
— Read the abstract → identify primary outcome and overall result first.
— Look for subgroup results only after understanding the overall effect.
— Apply the credibility checklist (pre-specification, p-interaction, plausibility, replication, multiplicity).
— Decide: act on overall result, treat subgroup as hypothesis-generating, or await replication.
— Just as you'd consult cardiology for unclear chest pain, consult a methodologist for unclear subgroup interpretation rather than acting unilaterally.
— Hospital librarians and EBM services can help locate replicating studies.
— "ICU-level" evidence: pre-specified subgroup, replicated, biologically grounded, large effect → can change practice.
— "Stepdown" evidence: pre-specified, single trial, plausible → cautious application, monitor.
— "Floor" evidence: post-hoc, exploratory → hypothesis-generating only.
— "Discharge home" evidence: zodiac-sign-equivalent → ignore.

— Effect modification: treatment effect genuinely differs across strata — a real phenomenon; report stratified estimates.
— Confounding: a third variable distorts the observed association — adjust or stratify to remove it.
— Same statistical maneuver (stratification) can reveal both; interpretation differs.
— Subgroup analysis: examines treatment effect within pre-defined strata of one variable.
— HTE: broader concept — recognizes that individual responses vary, often driven by baseline risk. Modern approach uses risk-based or model-based HTE rather than one-variable-at-a-time subgrouping.
— Per-protocol analysis excludes non-adherent patients → breaks randomization, introduces selection bias.
— Not technically a subgroup, but often confused with one. Intention-to-treat (ITT) is the primary analysis for efficacy.
— Sensitivity analysis: repeats the primary analysis under different assumptions (e.g., different missing data approach, different outcome definition) to test robustness.
— Not the same as testing effect in different patient strata.
— Mediation: examines mechanism (does the effect go through variable Z?).
— Subgroup: examines who benefits.
— Adjustment: controls for baseline imbalances in regression → improves precision, not effect modification.
— Subgroup: separately estimates effects in strata.

— Trials with multiple primary or secondary outcomes face the same multiplicity problem.
— Solution: hierarchical testing, α adjustment, pre-specified primary outcome.
— Combine outcomes (death, MI, stroke, hospitalization) to improve power.
— Pitfall: if "wins" are driven by a soft component (hospitalization) while hard components (death) show no effect, the overall positive result is misleading.
— Always inspect components individually.
— Subgroup analyses on surrogate markers (HbA1c, LDL, viral load) may not translate to clinical outcomes.
— Classic case: CAST trial — antiarrhythmics suppressed PVCs (surrogate) but increased mortality.
— Extreme baseline values tend to be less extreme on repeat measurement.
— Misinterpreting this as treatment effect in a "high-baseline" subgroup is a classic trap.
— In observational subgroup analyses, time during which an outcome cannot occur is misattributed to a treatment group.
— Common in pharmacoepidemiology comparing "adherent" vs. "non-adherent" subgroups.
— Subgroups defined by surviving long enough to receive a treatment or develop a marker are inherently selected for better prognosis.
— Subgroup-level associations (e.g., country-level data) misapplied to individuals.
— "Significant" subgroups get published; null subgroups remain in supplements or unpublished.
— Inflates apparent evidence for subgroup-specific effects.
— Even without explicit p-hacking, analytic flexibility (which subgroups, which covariates, which model) inflates false-positive rates.

— Always read the primary outcome first; let it anchor your interpretation.
— Locate the pre-specified analysis plan when subgroup claims are made.
— Check the forest plot for p-for-interaction and CI overlap, not within-group p-values.
— Look for replication in independent trials before changing practice.
— Trial registration (ClinicalTrials.gov) and SAP publication requirements.
— CONSORT and SPIRIT reporting standards.
— Pre-registration of analyses (Open Science Framework, AsPredicted).
— Mandatory disclosure of post-hoc status in journals.
— Regular journal club participation with explicit attention to subgroup methodology.
— UpToDate and guideline appendices often note the strength of subgroup evidence — read them.
— When guidelines cite subgroup data, check the level of evidence (Class I/IIa/IIb, Level A/B/C).
— Subgroup-driven recommendations are typically Class IIa or IIb with Level B or C evidence.
— Use absolute risk and NNT rather than relative effects when discussing benefit.
— Risk calculators (ASCVD, CHA₂DS₂-VASc, FRAX) operationalize risk-based HTE for individualized care.
— Don't let one new subgroup-driven publication overturn well-established overall effects.
— Be wary of pharmaceutical detailing that emphasizes a favorable subgroup.

— Treatment effects evolve as new trials and meta-analyses appear.
— A subgroup signal from one trial should be tracked for replication or refutation in subsequent studies.
— Living systematic reviews and Cochrane updates provide ongoing synthesis.
— When new pivotal trials are published in your specialty (typically annual major conferences — AHA, ACC, ASCO, ADA, ASH).
— When guideline updates are released (every 3–5 years for most societies).
— When FDA approvals or label changes occur — often based on subgroup or biomarker data.
— Initial trial publishes intriguing subgroup → does a confirmatory trial follow?
— Examples of replicated signals → adopted into guidelines (sacubitril/valsartan in HFrEF, then expanded to HFpEF subgroup with EF <60%).
— Examples of refuted signals → withdrawn or downgraded (vitamin E, hormone replacement therapy for CV protection).
— Health systems track prescribing patterns to ensure proven therapies are not being withheld based on misread subgroup data.
— Example: statins after MI — institutional audits identify under-prescribing in women and elderly, where overall trial data clearly support use.
— Decisions based on subgroup data should be revisited as evidence accumulates.
— Document the rationale ("treated per overall trial effect; awaiting replication of subgroup signal") to support continuity.
— Periodic self-audit: am I applying overall trial results consistently, or selectively withholding based on subgroup beliefs?
— Engage with EBM resources (Cochrane, USPSTF, society guidelines) rather than single-trial subgroup interpretations.
— De-implementation is harder than implementation; deliberately review and update practices when evidence evolves.

— Patients deserve honest framing of evidence — including the uncertainty around subgroup-specific claims.
— Avoid overstating benefit ("studies show this works especially well in people like you") when the subgroup data are weak.
— Conversely, avoid withholding ("studies say this doesn't work in your group") based on underpowered subgroups.
— Historical underrepresentation of women, minorities, elderly, and pregnant patients in trials produces subgroup analyses that are systematically underpowered — yet are sometimes used to deny these populations effective therapy.
— Ethical obligation: apply best available evidence (usually the overall effect) and advocate for inclusive trial enrollment.
— Industry-funded trials may emphasize favorable subgroups in marketing.
— Disclosure requirements (ICMJE, Sunshine Act / Open Payments database) help — but vigilance is required.
— Modern IRBs increasingly review SAPs and pre-registration to limit post-hoc subgroup fishing.
— Selective reporting of subgroups is a recognized form of research misconduct.
— A common Step 3 scenario: a patient discharged on a drug based on overall trial benefit; outpatient clinician sees a subgroup claim ("doesn't work in diabetics") and discontinues the medication. This medication reconciliation discontinuity can cause real harm (e.g., stopping a beta-blocker post-MI). The safer practice: do not discontinue evidence-based therapy on the basis of a single subgroup analysis without consulting current guidelines.
— CONSORT and journal policies require labeling post-hoc analyses.
— Failure to disclose can constitute scientific misconduct.
— Standard of care is defined by guidelines reflecting overall trial evidence.
— Deviating from guidelines based on personal interpretation of subgroup data — without documentation and patient consent — carries liability risk.

— Quantitative: same direction, different magnitude (common, often plausible).
— Qualitative: direction reverses (rare, demands strong evidence).
— ISIS-2: zodiac sign subgroup (parody).
— CAST: surrogate endpoint trap.
— HRT/WHI: observational subgroups vs. RCT.
— Vitamin E: observational benefit, RCT null.

— Stem: "Trial of drug X showed no overall mortality benefit (HR 0.95, p=0.4), but in patients with elevated CRP, HR 0.7 (p=0.03). The investigators recommend treating high-CRP patients with drug X. Which is the most appropriate response?"
— Answer: Recognize as post-hoc, hypothesis-generating; recommend confirmatory trial; do not change practice.
— Stem: "Trial showed overall benefit (HR 0.7, p<0.001); women showed HR 0.85 (p=0.2). Should drug be withheld from women?"
— Answer: No — overlapping CIs, no significant interaction, underpowered subgroup. Apply overall effect.
— Stem: "Authors tested 20 subgroups; one was significant at p=0.04. How should this be interpreted?"
— Answer: Likely false positive by chance; ~1 significant result expected.
— Stem: "Among patients who took ≥80% of doses, mortality was reduced. Authors conclude drug works in adherent patients."
— Answer: Recognize as per-protocol analysis — not a valid subgroup; breaks randomization.
— Stem: "p-for-interaction = 0.65. What does this mean?"
— Answer: No evidence that treatment effect differs across subgroups; apply overall effect.
— Stem: "Drug Y showed benefit only in patients with HER2 amplification (pre-specified, p-interaction <0.001, replicated)."
— Answer: Legitimate effect modification; use biomarker to guide therapy.
— Stem: "RRR similar across risk strata, but ARR larger in high-risk patients."
— Answer: Treat high-risk patients preferentially based on absolute benefit and NNT — not subgroup effect modification.
— Stem shows forest plot with all CIs crossing 1.0 except one. Answer: check p-for-interaction; likely chance finding.
— Stem: "Among patients who achieved LDL <70, mortality was lower." Answer: invalid subgroup; post-randomization variable.

Subgroup analyses are hypothesis-generating, not hypothesis-confirming — apply the overall trial effect unless a pre-specified, biologically plausible, replicated subgroup finding shows a statistically significant interaction.
— Pre-specified in the protocol/SAP before unblinding.
— Significant p-for-interaction (not just within-subgroup p-value).
— Strong biological/mechanistic rationale.
— Replicated across independent trials or meta-analyses.
— Few subgroups tested (multiplicity controlled).
— Defined by pre-randomization variable (never post-baseline).
— Post-hoc rescue of a null trial via a "positive" subgroup.
— Withholding effective therapy based on an underpowered "negative" subgroup (overlapping CIs, no significant interaction).
— Confusing per-protocol or on-treatment analyses with valid subgroup analyses.
— Misreading risk-based HTE (varying ARR with constant RRR) as statistical effect modification.
— Interpreting one significant result among 20 tests as meaningful without multiplicity correction.
— Treat patients according to the overall trial result and apply absolute risk / NNT thinking for personalization, reserving subgroup-driven deviation for high-credibility, biomarker-grounded, replicated findings (e.g., HER2, EGFR, BRCA).

