Biostatistics & Population Health
Sensitivity analysis in meta-analysis
— Not the same as subgroup analysis (which asks "does effect differ across groups?")
— Not the same as meta-regression (which models continuous moderators)
— Few trials (k < 10), small total sample, wide CIs crossing 1.0 narrowly
— One mega-trial contributing >50% of the weight
— High statistical heterogeneity (I² > 50–75%)
— Funnel plot asymmetry suggesting publication bias
— Mix of high– and low–risk-of-bias studies
— Industry-funded trials clustered on one side of the null
— Does the result hold after leave-one-out analysis?
— Does it hold in fixed-effect vs random-effects models?
— Does it hold if only low-risk-of-bias trials are pooled?
— Does it hold after trim-and-fill correction for missing studies?
— Does it hold across alternative effect measures (RR vs OR vs HR)?

— Chief complaint: What clinical question (PICO) does it address?
— Source: Cochrane, journal-published, AHRQ, industry-sponsored?
— Date and search strategy: Recent and reproducible, or outdated?
— Registration: Pre-registered on PROSPERO with a published protocol?
— Inclusion of conference abstracts or gray literature only on one side
— Language restriction (English-only) — risks Tower of Babel bias
— Single database searched (PubMed alone misses ~30% of eligible trials)
— No assessment of risk of bias using Cochrane RoB 2 or ROBINS-I
— Pooling of RCTs and observational studies together
— Use of post-hoc subgroups to rescue a null primary result
— "Significant overall, but one trial drives it" → leave-one-out reveals fragility
— "Significant but high heterogeneity" → random-effects vs fixed-effect divergence
— "Significant in published trials only" → publication bias suspected; trim-and-fill or Egger's test indicated
— "Significant with abstracts included" → unpublished, unpeer-reviewed data inflating effect

— Point estimates (squares): size proportional to study weight
— Confidence intervals (horizontal lines): width inversely related to precision
— Diamond at bottom: pooled estimate; width = pooled 95% CI
— Line of no effect at RR/OR = 1.0 (or RD = 0)
— Squares clustered on one side of the null with overlapping CIs → consistent effect
— Squares scattered widely with non-overlapping CIs → heterogeneity (visual confirmation of high I²)
— One dominant large square → leverage; leave-one-out is mandatory
— Diamond crossing the null → no statistically significant pooled effect
— X-axis: effect size; Y-axis: standard error (inverted, so large studies at top)
— Symmetric inverted funnel → no publication bias suggested
— Asymmetric (missing studies in bottom corner opposite the effect) → small negative trials likely unpublished
— Egger's regression test (p < 0.05 suggests asymmetry) — best for continuous outcomes
— Begg's rank correlation — less powerful
— Trim-and-fill — imputes missing studies and recomputes pooled estimate
— 0–25%: low
— 25–50%: moderate
— 50–75%: substantial
— >75%: considerable → pooling may be inappropriate

— Sequentially remove each study; recompute pooled estimate
— Identifies trials whose exclusion changes significance or magnitude meaningfully
— Especially critical when one study contributes >25–30% of weight
— Fixed-effect assumes one true underlying effect; weights by inverse variance
— Random-effects (DerSimonian-Laird, REML) assumes distribution of true effects; gives smaller studies relatively more weight
— Large divergence between the two models → heterogeneity is driving the result
— Step 3 default: random-effects when clinical/methodological diversity exists
— Re-pool using only low-risk-of-bias trials (Cochrane RoB 2)
— If effect attenuates substantially, the original estimate was inflated by methodological flaws (lack of allocation concealment, unblinded outcome assessment)
— Re-analyze using intention-to-treat vs per-protocol populations
— Re-analyze with alternative effect measures (RR vs OR vs HR)
— For composite endpoints, decompose to individual components
— Best-case/worst-case scenarios for missing outcome data
— Multiple imputation under MAR vs pattern-mixture under MNAR
— Pre-specified subgroups → confirmatory
— Post-hoc subgroups → hypothesis-generating only

— Imputes "missing" studies to symmetrize the funnel plot
— Provides an adjusted pooled estimate; if it crosses the null while the original did not, publication bias is likely material
— Limitation: assumes asymmetry = publication bias (may actually be true heterogeneity)
— Quantitative funnel asymmetry test
— p < 0.10 (note: more lenient threshold) suggests small-study effects
— Add studies sequentially in chronological order
— Reveals when the evidence "crossed" significance and whether it has stabilized
— Helps detect if early trials drove a now-overturned result (Proteus phenomenon)
— Vary prior distributions (skeptical, enthusiastic, neutral)
— Robust results survive a skeptical prior
— Test consistency between direct and indirect comparisons (node-splitting)
— Inconsistency suggests violations of the transitivity assumption
— Downgrade for risk of bias, inconsistency, indirectness, imprecision, publication bias
— Upgrade for large effect, dose-response, plausible confounders biasing toward null
— Final certainty: high / moderate / low / very low
— Distribution of significant p-values reveals evidential value vs p-hacking
— Gold standard; allows uniform outcome definitions and time-to-event modeling

— Tier 1 (Robust): Effect direction, magnitude, and significance preserved across leave-one-out, fixed vs random, low-RoB subset, and trim-and-fill → act on it
— Tier 2 (Moderately robust): Direction and significance preserved in primary analyses but attenuated in low-RoB or trim-and-fill subsets → act with caveats, individualize
— Tier 3 (Fragile): Significance lost in ≥1 key sensitivity analysis, or one trial drives the result → do not change practice on this alone; await further trials
— Tier 4 (Uninterpretable): I² > 75%, k < 5, severe publication bias → pooled estimate is misleading; reason from individual high-quality trials
— A robust but tiny effect (NNT > 500) may not justify intervention costs or harms
— A fragile but large effect deserves a confirmatory adequately-powered RCT before adoption
— USPSTF, AHA/ACC, IDSA use GRADE-like systems; strong recommendations typically require high-certainty evidence (robust meta-analysis + consistent RCTs)
— Weak/conditional recommendations acknowledge fragility and emphasize shared decision-making
— Assess certainty of evidence (GRADE level)
— Assess magnitude (absolute risk reduction, NNT)
— Assess applicability (do they resemble the trial population?)
— Assess harms and patient values
— Document shared decision-making

— Assumes a single true effect size across all studies
— All variability is sampling error
— Weights largely favor large studies
— Narrow CIs; aggressive significance
— Appropriate when studies are clinically and methodologically homogeneous
— Assumes a distribution of true effects
— Incorporates between-study variance (τ²)
— Smaller studies relatively up-weighted vs fixed-effect
— Wider CIs; more conservative
— Default for most clinical meta-analyses with any heterogeneity
— Q statistic (Cochran's): underpowered with few studies
— I²: proportion of total variance from heterogeneity
— τ²: absolute between-study variance (interpretable on effect scale)
— Prediction interval: range expected for a new trial — often much wider than CI
— Investigate sources of heterogeneity (meta-regression on year, dose, population)
— Consider whether pooling is even appropriate
— Peto OR: for rare events (<1%) and balanced arms
— Hartung-Knapp-Sidik-Jonkman: better CI coverage with few studies
— Continuity corrections for zero-event cells (add 0.5 or use exact methods)
— RR preferred for cohort/RCT — clinically intuitive
— OR when outcomes are common, OR exaggerates effect vs RR
— HR for time-to-event data
— Mean difference for continuous outcomes on same scale
— Standardized mean difference (SMD, Hedges' g) for different scales

— Regress effect size on study-level covariates (mean age, baseline risk, dose, year)
— Helps explain heterogeneity quantitatively
— Requires k ≥ 10 studies per covariate (rule of thumb)
— Ecological fallacy risk: study-level associations ≠ individual-level associations
— Pre-specified in protocol, biologically plausible, limited in number → confirmatory
— Post-hoc, numerous, or driven by data inspection → hypothesis-generating, high false-positive rate
— Test for subgroup interaction (p-interaction) rather than separate within-subgroup p-values
— Compares ≥3 interventions using direct and indirect evidence
— Requires transitivity assumption: trials are comparable in effect modifiers
— Test consistency by comparing direct vs indirect estimates (node-splitting, design-by-treatment interaction)
— Produces SUCRA rankings — interpret cautiously, especially with sparse networks
— Drop trials with high risk of bias
— Restrict to head-to-head trials only
— Test alternative network geometries
— Standardize outcome definitions, adjust for individual covariates, perform time-varying analyses
— Gold standard but resource-intensive
— Continuously updated as new trials emerge
— Use trial sequential analysis (TSA) to control type I error from repeated looks — analogous to interim analyses in single RCTs

— DerSimonian-Laird random-effects underestimates τ² with small k
— Use HKSJ adjustment or Bayesian methods with weakly informative priors
— Heterogeneity statistics (I², Q) are underpowered — absence of detected heterogeneity ≠ homogeneity
— Standard inverse-variance methods unstable
— Use Peto OR, Mantel-Haenszel without continuity correction, or exact methods (beta-binomial)
— Zero-event trials: avoid arbitrary 0.5 continuity corrections when possible
— Wide CIs, fragile to single-study addition
— Trial sequential analysis can indicate whether the required information size has been reached
— Often must include observational studies; use ROBINS-I for risk-of-bias assessment
— GRADE typically starts at "low" certainty for observational evidence
— High-bias, incomplete reporting trials are the meta-analytic equivalent of "impaired" inputs
— Sensitivity: restrict to fully reported, low-bias trials
— Contact authors for missing data (a Cochrane standard)
— Older trials may use outdated comparators, doses, or diagnostic criteria
— Sensitivity: restrict to trials within the last 10–15 years or using current standard-of-care comparators

— Tend to report effects favoring sponsor's product (sponsorship bias)
— Sensitivity analysis: stratify by funding source; if effect is restricted to industry-funded trials, interpret cautiously
— Look for selective outcome reporting — pre-registered protocols on ClinicalTrials.gov vs published primary outcomes
— Often few RCTs; extrapolation from adult data common but flawed
— Outcome measures may differ (weight-for-age, developmental milestones)
— Sensitivity: restrict to age-appropriate dosing studies; assess indirectness for adult-derived data
— RCTs uncommon for ethical reasons; meta-analyses rely on cohort and registry data
— Confounding by indication is the dominant threat
— Sensitivity: restrict to studies with propensity-score matching or active comparator designs rather than untreated-pregnant controls
— Example: SSRI and persistent pulmonary hypertension of the newborn — initial meta-analyses showed elevated risk; sensitivity analyses restricting to studies controlling for maternal depression substantially attenuated the effect
— Many meta-analyses underrepresent Black, Hispanic, and Asian patients
— Generalizability (external validity) is limited; GRADE downgrades for indirectness
— Sensitivity: where possible, restrict to trials with adequate demographic representation
— Acknowledge low certainty from observational sources
— Use shared decision-making, document risks/benefits, involve maternal-fetal medicine
— Reference MotherToBaby or LactMed for curated, sensitivity-aware summaries

— First few small trials show large effects that shrink as larger trials accumulate
— Cumulative meta-analyses and sensitivity analyses reveal instability
— Historical example: early meta-analyses of magnesium for acute MI suggested benefit; later large RCTs (ISIS-4) showed no effect
— Different analytical choices (effect measure, model, inclusion criteria) yield materially different pooled estimates
— A study showing wide vibration indicates the result is analytically fragile
— Trim-and-fill or Egger's test reveals asymmetric funnels
— Adjusted estimate often crosses the null
— Practice change based on biased pool causes patient harm (unnecessary treatment, adverse effects)
— Reporting a "significant" subgroup from a null overall meta-analysis
— Inflates type I error; misleads guideline writers
— Pooling clinically dissimilar trials produces an estimate that applies to no real patient
— Apples-and-oranges critique
— Meta-analyses including pre-2000 trials may reflect superseded standard-of-care
— Indirect comparisons in NMA driven by intransitive evidence produce misleading rankings
— Overtreatment (e.g., hormone replacement therapy for CV prevention — observational pooling suggested benefit, RCTs showed harm)
— Underuse (e.g., beta-blockers in HFrEF initially questioned by some meta-analyses, later confirmed beneficial)
— Wasted resources on low-value care

— Pooled estimate fragile under leave-one-out
— Substantial heterogeneity unexplained by meta-regression
— Funnel asymmetry with trim-and-fill crossing the null
— Clinically important effect with low GRADE certainty
— Practice-changing implications (cost, harm, large eligible population)
— New large trial published since last meta-analysis
— Trial sequential analysis indicates required information size not yet reached
— Cochrane reviews are typically updated every 2–4 years
— Meta-analysis result conflicts with current guideline recommendation
— New safety signal emerges from sensitivity analyses
— Comparative effectiveness data (NMA) suggests preferred agent has changed
— Complex IPD analyses
— Network meta-analysis with inconsistency
— Bayesian sensitivity with non-trivial priors
— Rare-event data requiring exact methods
— Robust, high-certainty meta-analytic evidence aligned with major guidelines
— Cost-effectiveness analysis supports adoption
— Implementation feasibility (staffing, equipment) confirmed
— Change individual practice based on a single fragile meta-analysis
— Discard a robust meta-analysis because of one contradictory small trial
— Conflate statistical significance with clinical importance

— Sensitivity: tests robustness of overall estimate to assumptions/methods
— Subgroup: examines whether effect differs between patient groups
— Both can be pre-specified; subgroup is hypothesis-driven about effect modification
— Sensitivity: discrete re-analyses under alternative scenarios
— Meta-regression: continuous modeling of study-level covariates explaining heterogeneity
— Influence diagnostics: Cook's distance, DFBETAS, externally standardized residuals at the study level
— Identifies which studies disproportionately drive the pooled estimate — a type of sensitivity analysis
— RoB: study-level quality scoring (Cochrane RoB 2, ROBINS-I, Newcastle-Ottawa)
— Sensitivity: pools or excludes studies based on RoB; the use of RoB in re-analysis
— GRADE: overall certainty rating across an evidence body
— Sensitivity: one input to GRADE's consistency and risk-of-bias domains
— TSA: adjusts for repeated testing as evidence accumulates; defines required information size
— Both address whether current pooled evidence is conclusive
— Bayesian: integrates prior beliefs with data; sensitivity to prior choice is itself a sensitivity analysis
— PIs: range expected for a future trial under the random-effects model
— Complementary; both convey uncertainty beyond the pooled CI

— Statistically significant or favorable studies more likely to be published
— Detect: funnel plot, Egger's test, trim-and-fill
— Mitigate: comprehensive search including gray literature, trial registries, contacting authors
— Within-trial cherry-picking of which outcomes to publish
— Detect: compare published outcomes to pre-registered protocol on ClinicalTrials.gov or PROSPERO
— Mitigate: outcome-level risk-of-bias assessment
— Negative trials take longer to publish
— Recent meta-analyses without time-lag sensitivity may overstate effects
— English-only searches miss non-English negative trials more than positive ones
— Tower of Babel bias
— Positive trials cited more; identified more easily in snowball searches
— Different databases (PubMed, Embase, CENTRAL, Scopus) overlap imperfectly
— Single-database searches miss 20–40% of eligible studies
— Confounding by indication is the dominant threat
— Mitigate: restrict to active-comparator new-user designs; instrumental variable analyses
— Sensitivity/specificity vary with disease prevalence and severity mix
— Bivariate or HSROC models handle this; sensitivity analyses by spectrum
— Test result influences whether gold standard is applied
— Inflates apparent sensitivity

— Sensitivity analyses listed a priori
— Distinguishes confirmatory from exploratory
— Explicitly require reporting of sensitivity analyses and their results
— Item 13f: "any sensitivity analyses conducted to assess robustness"
— Report pooled estimate with each sensitivity result side-by-side
— Show forest plots or tables comparing alternatives
— State which sensitivity analyses were pre-specified vs post-hoc
— Use sensitivity results to inform inconsistency and risk-of-bias domains
— Downgrade certainty when results are fragile
— Develop a checklist: search adequacy → RoB → heterogeneity → publication bias → sensitivity analyses → GRADE certainty → applicability
— Use validated tools: AMSTAR-2 for systematic review quality, ROBIS for risk of bias in the review itself
— Recommendations should reference robust meta-analytic estimates
— Conditional recommendations explicitly acknowledge fragility
— Communicate uncertainty alongside point estimates
— Use absolute risks and NNTs, not relative effects alone
— Document discussion of evidence quality in the chart
— Journal club discussions should routinely include "what sensitivity analyses were performed, and did the result hold?"
— Trainees should learn to identify the dominant trial in any forest plot

— Continuously updated; identify when new trials change the pooled estimate
— Cochrane and several guideline bodies (e.g., WHO COVID-19 guidelines) use this model
— Publication of a trial larger than the previously largest included
— Publication of any trial doubling the total sample size
— Emergence of a new safety signal
— Routine 2–4 year cycle
— Required information size (RIS) calculation based on type I/II error and assumed effect
— Crossing the monitoring boundary indicates conclusive evidence
— Not crossing indicates further trials needed
— Plot pooled estimate over time
— Reveals when evidence stabilized and whether a recent reversal occurred
— Quality metrics tracking adherence to evidence-based recommendations
— Adjust when meta-analytic evidence base changes
— Patients on long-term therapy based on prior evidence (e.g., aspirin for primary prevention) need periodic reassessment when guidelines change
— Annual visits are a natural checkpoint
— Frame uncertainty honestly: "Current best evidence suggests... but this may change as new trials are published"
— Avoid evidence-based whiplash by waiting for guideline-level consensus before reversing patient-level decisions
— Subscribe to journal alerts, guideline updates (USPSTF, AHA, ACP)
— Attend journal club; participate in institutional EBM rounds
— Teach trainees to interrogate sensitivity analyses, not just point estimates
— Build EBM literacy into resident curriculum

— Patients consenting to a treatment based on a meta-analysis deserve disclosure of uncertainty — magnitude of benefit, absolute risks, GRADE certainty
— A "robust" estimate is not the same as "no risk of being wrong"
— Document evidence-based discussion explicitly
— Authors of meta-analyses have an ethical obligation to perform and report sensitivity analyses transparently
— Selective reporting of favorable sensitivity results is research misconduct
— PROSPERO pre-registration and PRISMA reporting standards are professional norms
— Meta-analyses funded by drug manufacturers must be scrutinized for sponsorship bias
— ICMJE requires disclosure; readers should weigh independent vs industry evidence
— Restricting sensitivity analyses to non-industry trials is an ethical safeguard
— Newly identified safety signals from meta-analyses (e.g., suicidality with SSRIs in adolescents, CV risk with rosiglitazone) trigger FDA black-box warnings and clinician duty to inform patients
— Failure to update practice after such advisories can constitute substandard care
— A patient discharged on a medication started during an era of one meta-analytic estimate may have outdated rationale by the next admission
— Medication reconciliation should include reassessing indication, not just dose
— Concrete Step 3 example: a patient on long-term aspirin for primary prevention started in 2015 should have the indication revisited at the next outpatient or hospital transition; current evidence supports deprescribing in most adults ≥60
— Meta-analyses underrepresenting minority populations risk perpetuating disparities
— Ethically, recommendations should acknowledge limited generalizability
— Rapidly adopting fragile meta-analytic findings can cause harm (e.g., flecainide post-MI based on early surrogate-endpoint evidence)
— Institutional EBM committees provide safety oversight
— Practice clearly counter to high-certainty meta-analytic and guideline evidence increases malpractice risk
— Documentation of shared decision-making mitigates this


— "After excluding the largest trial, the pooled OR shifts from 0.65 (95% CI 0.50–0.85) to 0.92 (95% CI 0.75–1.13). What is the best interpretation?"
— Answer: The original pooled estimate is driven by a single trial; the meta-analytic evidence is fragile and does not robustly support intervention.
— "Funnel plot is asymmetric; Egger's test p=0.04; trim-and-fill adjusted OR crosses 1.0."
— Answer: Publication bias likely inflated the original estimate; recommendation should not be based on this meta-analysis alone.
— "I²=82%, prediction interval crosses the null."
— Answer: Substantial heterogeneity limits applicability of the pooled estimate; explore sources via meta-regression or restrict to clinically homogeneous subset.
— Fixed-effect OR 0.70 (significant); random-effects OR 0.90 (non-significant).
— Answer: Heterogeneity is driving the difference; random-effects is more appropriate when between-study variability exists.
— Observational meta-analysis suggests benefit; subsequent large RCT shows no effect or harm.
— Answer: Confounding by indication explained the apparent benefit; trust the RCT.
— Overall meta-analysis null, but post-hoc subgroup in patients aged 60–70 shows benefit.
— Answer: Hypothesis-generating only; do not change practice based on post-hoc subgroup.
— Direct comparison favors Drug A; indirect comparison favors Drug B; node-splitting shows inconsistency.
— Answer: Transitivity assumption violated; rankings unreliable.
— A 2010 meta-analysis supported intervention; 2023 update including 3 large RCTs shows no effect.
— Answer: Practice should follow updated evidence; reassess patients on long-term therapy.
— Effect is present in industry-sponsored trials only.
— Answer: Sponsorship bias suspected; interpret cautiously.
— k=4, total n=380, pooled OR significant.
— Answer: Insufficient evidence; recommend further RCTs before practice change.

Sensitivity analysis is the methodological stress test that distinguishes a meta-analysis whose conclusion you can act on from one you cannot — robust effects survive leave-one-out, alternative models, low-RoB restriction, and publication-bias adjustment; fragile effects do not, and fragile evidence should not change practice.
— Leave-one-out is the single most commonly tested sensitivity analysis on Step 3 — identifies trials driving the pooled estimate.
— Random-effects with HKSJ is the safest default model when heterogeneity exists or k is small; report I² and prediction intervals alongside the pooled estimate.
— Funnel plot + Egger's test + trim-and-fill assess publication bias; if the adjusted estimate crosses the null, the original recommendation is unreliable.
— GRADE certainty integrates sensitivity findings into a clinically usable rating (high/moderate/low/very low); strong recommendations require robust, high-certainty evidence.
— Post-hoc subgroup analyses are hypothesis-generating only; pre-specified, biologically plausible subgroups with significant interaction tests are confirmatory.
— Observational meta-analyses showing modest effects (RR 1.2–1.5) are dominated by confounding; trust RCTs when they contradict.
— Step 3 clinical action: When a meta-analysis is fragile, the right answer is usually "continue standard of care," "recommend further trials," or "individualize through shared decision-making" — not immediate practice change.
— Ethical duty: Disclose evidence uncertainty during informed consent; reassess long-standing therapies (e.g., aspirin for primary prevention) when the underlying evidence base evolves.
— Board pearl: A robust meta-analysis is one whose direction and clinical significance persist across multiple sensitivity analyses — exact point estimates will drift; what matters is whether the bottom line for the patient changes.

