Biostatistics & Population Health
Logistic regression and odds ratio interpretation
— Outcome is dichotomous and you want the probability of the event as a function of predictors.
— You need to adjust for confounders (age, sex, comorbidity) in an observational cohort or case-control study.
— A linear regression would be inappropriate because predicted probabilities must lie between 0 and 1.
— log(odds of outcome) = β₀ + β₁X₁ + β₂X₂ + …
— Each β coefficient, when exponentiated (e^β), yields an adjusted odds ratio (aOR) for that predictor.
— Vignettes describe a published study ("after adjustment for age and smoking, the OR for MI was 2.4 [95% CI 1.6–3.5]") and ask what the number means, whether it is significant, or whether causation can be inferred.
— Common in case-control studies, where logistic regression is the natural analytic tool because incidence cannot be directly calculated.
— "Adjusted odds ratio," "multivariable model," "controlling for…," or a binary endpoint with multiple covariates.
Board pearl: If the outcome is binary and the question gives an odds ratio with 95% CI, the underlying analysis is almost always logistic regression. If the outcome is time-to-event (with censoring), think Cox proportional hazards and hazard ratios instead — a frequent distractor pair on Step 3.

— A case-control study of patients with pancreatic cancer vs. matched controls reports an aOR for heavy alcohol use of 1.8 (95% CI 1.2–2.7) after adjusting for smoking, BMI, and diabetes.
— A retrospective cohort of postoperative patients models 30-day readmission against age, ASA class, and discharge disposition.
— A cross-sectional survey examines factors associated with vaccine uptake.
— Outcome is described in binary terms ("developed delirium," "experienced readmission," "tested positive").
— Effect estimates reported as OR or adjusted OR, not RR, HR, or mean difference.
— Mention of multiple covariates being "controlled for" or "entered into the model."
— Case-control design → must use OR (cannot compute incidence) → logistic regression is the standard.
— Cohort design with binary outcome and no time-to-event focus → logistic regression is appropriate, though RR is often preferable when the outcome is common.
— Rare outcome (<10%) → OR ≈ RR, so logistic regression OR can be interpreted similarly to RR.
— Common outcome (>10%) → OR overstates RR; this is a classic Step 3 trap.
Key distinction: Odds ratio ≠ relative risk. The OR approximates RR only when the outcome is uncommon. When a stem reports a 40% event rate and an OR of 3.0, the true RR is substantially smaller — recognizing this is a high-yield Step 3 testing point about interpretation literacy, not calculation.

— Sample size adequacy: rule of thumb is ~10 outcome events per predictor variable. A model with 5 covariates needs ≥50 events.
— Confidence interval width: very wide CIs (e.g., OR 2.5, 95% CI 0.4–18) signal sparse data or overfitting.
— Goodness-of-fit: Hosmer–Lemeshow test (non-significant p-value = adequate fit) and C-statistic (AUC) for discrimination (>0.7 acceptable, >0.8 strong).
— Calibration: predicted vs. observed event rates across risk deciles.
— Multicollinearity: highly correlated predictors (e.g., BMI and weight) inflate standard errors and produce unstable ORs.
— Extremely large OR (e.g., 25) with CI crossing or near 1 in a small subgroup.
— Missing data handling not described.
— No mention of how confounders were selected (cherry-picked covariates → residual confounding).
— Confounder: adjusting changes the OR meaningfully (>10%).
— Effect modifier (interaction): the OR differs across strata of a third variable; reported as separate ORs or an interaction term.
Board pearl: A logistic regression OR with a 95% CI that includes 1.0 is not statistically significant at α=0.05, regardless of how large the point estimate looks. This is the single most testable interpretive fact about ORs on Step 3 — examinees consistently miss it when distracted by a dramatic point estimate.

— OR = 1: no association between exposure and outcome.
— OR > 1: exposure associated with increased odds of outcome.
— OR < 1: exposure associated with decreased odds (protective).
— Same interpretation, but controlling for the other variables in the model.
— "aOR 1.7 for smoking" means smokers have 1.7× the odds of the outcome compared to non-smokers, holding other covariates constant.
— OR is per one-unit increase in the predictor.
— Example: aOR 1.04 per year of age means each additional year raises odds by 4%. Over 10 years, odds multiply by 1.04¹⁰ ≈ 1.48.
— Step 3 stems sometimes report per-SD or per-10-unit increases — read carefully.
— OR is for that category vs. the reference group (often the lowest or "none" category).
— Determined by 95% CI not crossing 1, or equivalently p < 0.05.
— A "borderline" CI (e.g., 1.01–2.50) is technically significant but clinically fragile.
— A statistically significant OR of 1.05 in a huge dataset may be clinically trivial.
— Conversely, an OR of 3.0 with CI 0.9–10 in a small study is clinically suggestive but underpowered.
Step 3 management: When asked to interpret an aOR, always state (1) direction (risk vs. protective), (2) magnitude, (3) whether the CI excludes 1, and (4) what was adjusted for — examiners reward this structured reading over numerical manipulation.

— For race, BMI category, or insurance status, the OR depends entirely on what the reference group is. A stem reporting "OR 2.1 for obese" implicitly references normal-weight individuals.
— Tested by adding a product term (X₁ × X₂) to the model.
— If the interaction term is significant, the effect of one variable depends on the level of another — report stratified ORs, not a single overall OR.
— Example: aspirin's effect on stroke differs by sex → stratify by sex.
— When two predictors are highly correlated (e.g., systolic and diastolic BP), individual ORs become unstable. Drop one or combine them.
— Too many predictors for too few events produces ORs that won't replicate. Validate in an independent dataset.
— Used for matched case-control studies (e.g., 1:1 matched on age and sex). Accounts for the matched design.
— Outcome has >2 unordered categories (e.g., disease A vs. B vs. none).
— Outcome is ordered (mild/moderate/severe).
— Used when predictors outnumber events, to shrink coefficients and improve generalizability.
Key distinction: Logistic regression gives odds ratios; Poisson regression gives rate ratios (events per person-time); Cox regression gives hazard ratios (instantaneous risk over time). Step 3 distractor sets routinely swap these — match the analytic method to the outcome structure, not to which sounds most impressive.

— Binary outcome, no time component → logistic regression → OR.
— Binary outcome, time-to-event with censoring → Cox proportional hazards → HR.
— Count outcome (number of falls, admissions) → Poisson or negative binomial → IRR.
— Continuous outcome → linear regression → β coefficient (mean difference per unit).
— Ordinal outcome → ordinal logistic → proportional odds.
— Case-control studies (mandatory — incidence unknowable).
— Cross-sectional studies with binary outcomes.
— Cohort studies when the outcome is rare and OR ≈ RR.
— Cohort studies with common outcomes, where readers may misinterpret OR as RR. Modified Poisson regression or log-binomial regression yields RR directly.
— A priori based on causal/DAG reasoning (preferred).
— Change-in-estimate (>10% change in OR when variable is added).
— Avoid stepwise selection — it inflates type I error and is poorly regarded.
— Used in observational studies to balance covariates between exposed and unexposed groups.
— Often paired with logistic regression to estimate the propensity itself.
Board pearl: If a Step 3 stem describes a case-control study, the only correct effect measure is the odds ratio — you literally cannot compute RR or incidence because participants were sampled on outcome status. Selecting "relative risk" as the answer is a guaranteed wrong choice in that setting.

— Interpretation: Smokers have 2.3× the odds of MI compared to non-smokers, holding other listed factors constant; statistically significant (CI excludes 1).
— This does not mean smokers are 2.3× more likely (that would be RR), nor does it prove causation.
— Statins associated with 40% lower odds of dementia after adjustment.
— Protective association; significant. But observational → residual confounding (healthy-user bias) possible.
— Each 1 mg/dL rise increases odds by 4%; per 40 mg/dL (one SD-ish), odds multiply by 1.04⁴⁰ ≈ 4.8.
— Tiny per-unit OR can be clinically large over a realistic range.
— Not significant (CI crosses 1); cannot conclude an association exists.
— Effect modification by age; report stratified, not pooled, ORs.
Step 3 management: When the vignette gives you an aOR, translate it into plain English ("X has 2.3 times the odds of Y, adjusted for…") before evaluating answer choices. Most distractors are reworded errors (confusing OR with RR, missing CI inclusion of 1, ignoring adjustment).

— When outcome prevalence is high (say 30%), an OR of 3.0 corresponds to an RR closer to 1.8. Reporters and readers often quote the OR as if it were RR, exaggerating effect size.
— In cross-sectional logistic regression, you cannot tell whether the exposure preceded the outcome.
— Hospital-based case-control studies select controls who may differ systematically (Berkson's bias), distorting ORs.
— Cases remember exposures (e.g., medications during pregnancy) more thoroughly than controls, inflating ORs.
— Including a variable on the causal pathway (mediator) between exposure and outcome attenuates the true effect. Example: adjusting for LDL when studying saturated fat and CAD.
— Very few events in a covariate cell produce wildly inflated ORs with huge CIs. Look for OR > 10 with CI lower bound near 1.
— Testing 20 predictors at α=0.05 yields ~1 false-positive by chance. Bonferroni or FDR correction tightens the threshold.
— β₀ gives baseline log-odds when all predictors = 0; rarely clinically meaningful unless predictors are centered.
Board pearl: A statistically significant OR in an observational study does not establish causation. Step 3 will offer "X causes Y" as a tempting answer; the correct response is typically "X is associated with Y, after adjustment for measured confounders" — preserving epistemic humility about unmeasured confounding.

— Total sample size is small (<100).
— Number of events is small relative to predictors (<10 events per variable).
— Cells are empty (e.g., no exposed cases) — produces infinite or undefined ORs.
— Exact logistic regression: computes exact p-values for sparse data.
— Firth's penalized likelihood: reduces bias from small samples and separation.
— Bayesian logistic regression: incorporates prior information when data are scarce.
— Collapsing categories or dropping rare predictors.
— Subgroups inevitably have fewer events → wider CIs → less power. A "non-significant" subgroup finding may simply reflect insufficient power, not absence of effect.
— Pre-specified subgroup analyses are more credible than post-hoc.
— Often have small Ns (e.g., dialysis patients, transplant recipients), so reported aORs are imprecise. Look for very wide CIs.
— STROBE for observational studies, TRIPOD for prediction models — both require disclosure of sample size, events, and model performance.
Key distinction: A non-significant OR in an underpowered study does not mean "no effect" — it means the data cannot distinguish the true effect from no effect. The correct interpretation is "inconclusive," not "no association." Step 3 frequently tests this nuance with subgroup-analysis vignettes.

— Outcomes (e.g., preterm birth, congenital anomaly, NICU admission) are dichotomous.
— RCTs are limited for ethical reasons, so observational designs dominate.
— Matched case-control designs are common (e.g., matching on maternal age, parity) → use conditional logistic regression.
— Cluster effects: siblings or twins share exposures; ignoring this yields falsely narrow CIs. Use generalized estimating equations (GEE) or mixed-effects logistic regression.
— Time-varying exposures during pregnancy: simple logistic regression cannot handle this; consider trimester-specific models.
— ORs for race/ethnicity must be interpreted as associations, not biological causation — they reflect structural factors confounded with race.
— Step 3 increasingly tests recognition that race is a social construct in epidemiologic models.
— Competing risks (death from another cause) may bias logistic regression of non-fatal outcomes. Cox models with competing-risk adjustment are preferred when applicable.
Board pearl: When a study reports an aOR for a maternal exposure (e.g., SSRI use) and a fetal outcome (e.g., cardiac defect), look for adjustment for the underlying indication (depression itself) — confounding by indication is the classic source of inflated drug-harm signals on Step 3 obstetric biostatistics stems.

— A press release says "drinking coffee doubles risk of cancer" based on OR 2.0 in a case-control study. Two errors: (1) OR ≠ risk; (2) association ≠ causation.
— Researchers may select adjustments that move the OR toward significance. Pre-registration mitigates this.
— Scanning hundreds of predictors yields false-positive ORs by chance. Requires correction (Bonferroni, FDR).
— Reporting only point estimates without CIs hides uncertainty.
— An aOR for age (per year) derived in a 40–70 cohort should not be applied to 20-year-olds.
— Logistic regression assumes linearity of continuous predictors on the log-odds scale. Violations bias coefficients; remedies include splines or categorization.
— Misclassifying exposure status during a period when outcome was impossible inflates protective ORs (common in drug-effectiveness studies).
— Significant ORs are more likely to be published; meta-analyses may overstate true effects.
Step 3 management: When evaluating a published OR, ask: (1) Was the design appropriate? (2) Were confounders measured and adjusted? (3) Is the CI narrow enough to be informative? (4) Does the conclusion match the design (association vs. causation)? Apply this checklist on every biostats vignette.

— Repeated measures per subject (longitudinal data) → mixed-effects logistic regression or GEE.
— Clustered data (patients within hospitals) → multilevel/hierarchical models.
— Time-to-event outcome with meaningful censoring → switch to Cox regression.
— Strong confounding by indication → propensity score matching or instrumental variables.
— High-dimensional data (genomics, claims data) → penalized regression (lasso, elastic net) or machine-learning alternatives.
— Causal inference goals → targeted maximum likelihood estimation, g-methods, marginal structural models.
— Before data collection (study design, sample size).
— When the planned analysis involves matched, clustered, or longitudinal data.
— When interactions or non-linearities are suspected.
— Before publishing — peer reviewers will catch errors that authors miss.
— STROBE — observational studies.
— TRIPOD — prediction models.
— CONSORT — RCTs (logistic regression often used for binary outcomes within RCTs).
— PRISMA — systematic reviews and meta-analyses.
CCS pearl: On a CCS-style biostatistics-in-clinical-practice item, if your team is interpreting a quality-improvement dataset on readmissions, the appropriate "order" is a risk-adjusted logistic regression controlling for case mix — raw rates can unfairly penalize hospitals caring for sicker patients. Recognize risk adjustment as a core safety/quality concept.

— Chi-square: tests association between two categorical variables, no adjustment.
— Logistic regression: provides effect size (OR) and allows multivariable adjustment.
— Fisher's: 2×2 tables with small expected counts; no covariate adjustment.
— Logistic: scalable to multiple predictors.
— Both handle binary outcomes; log-binomial yields RR directly, preferred when outcome is common in a cohort. Convergence issues sometimes force fallback to modified Poisson with robust SEs.
— Both model binary outcomes; probit uses normal CDF instead of logistic. Coefficients aren't directly interpretable as ORs. Rarely tested.
— Conditional: matched designs.
— Unconditional: unmatched.
— Multinomial: unordered ≥3 categories.
— Ordinal: ordered ≥3 categories; assumes proportional odds.
— Treats binary outcome with OLS; coefficients = risk differences. Simpler interpretation but can predict probabilities outside [0,1].
Key distinction: A 2×2 contingency table with no covariates → chi-square or Fisher's. The same data with adjustment for confounders → logistic regression. Step 3 stems differentiate by whether "after adjusting for…" appears in the methods description.

— Binary outcome, case-control friendly, ≠ RR when outcome is common.
— Direct probability ratio; intuitive; requires cohort data.
— Clinically most useful for treatment decisions; NNT = 1/ARR.
— Time-to-event with censoring; assumes proportional hazards.
— Counts over person-time.
— Continuous outcomes.
— Effect size for continuous outcomes across different scales.
— Diagnostic test performance; distinct from regression-derived measures.
— Derived from ARR; communicates clinical impact.
When a question asks "which measure best communicates clinical impact to a patient," the answer is usually ARR or NNT, not OR or RR. When the design is case-control, the answer is OR. When the outcome is time-to-event, the answer is HR.
Board pearl: An OR of 5.0 sounds impressive but may correspond to an ARR of only 2% (NNT 50) — clinical communication should emphasize absolute measures. Step 3 ethics/communication items often pair this with informed consent scenarios where patients deserve absolute risk numbers, not relative ones.

— Pre-specify the model: outcome, predictors, interactions, sensitivity analyses.
— Register the protocol (ClinicalTrials.gov, OSF) before analysis.
— Report all covariates considered, not only those retained.
— Provide full CI and exact p-values, not just "p<0.05."
— Include a sensitivity analysis (e.g., complete-case vs. multiple imputation for missing data).
— Validate prediction models in an independent cohort.
— Build a habit of reading methods before results.
— Check whether the design supports the conclusion (cross-sectional cannot establish temporality).
— Look for competing risks in elderly cohorts.
— Note whether effect modifiers were tested.
— Confirm that the reference category is clinically meaningful.
— An aOR from observational data should rarely change practice alone; integrate with RCT evidence and biological plausibility (Bradford Hill criteria).
— Meta-analyses pool ORs across studies; check for heterogeneity (I² statistic).
Step 3 management: When a clinic adopts a risk-prediction tool built on logistic regression (e.g., ASCVD risk, Wells score, MELD), confirm it was externally validated in a population similar to yours. Unvalidated locally derived models often miscalibrate, leading to over- or under-treatment.

— Discrimination (C-statistic / AUC): ability to rank patients; 0.5 = chance, 1.0 = perfect. 0.7–0.8 acceptable.
— Calibration: agreement between predicted and observed event rates across deciles; visualized as calibration plot.
— Brier score: mean squared error of predicted probabilities; lower is better.
— Net reclassification improvement (NRI) and integrated discrimination improvement (IDI): gauge whether a new predictor adds incremental value.
— Patient population shifts (demographics, comorbidities).
— Treatment changes (improved care lowers event rates).
— Coding changes (ICD-9 → ICD-10).
— Care-setting changes.
— Use risk scores as decision aids, not substitutes for clinical judgment.
— Communicate absolute predicted risk to patients, with uncertainty.
— Document shared decision-making, especially when crossing treatment thresholds (e.g., ASCVD 7.5% for statin initiation).
— Audit model outputs vs. observed outcomes annually.
— Retrain or recalibrate when calibration degrades.
Board pearl: A model with an excellent C-statistic but poor calibration overstates risk in some patients and understates it in others — and is therefore dangerous for individual decision-making. Discrimination and calibration are separate properties; both must be acceptable. Step 3 distractor sets often praise a "high AUC" while ignoring calibration failure.

— Logistic regression models trained on biased data perpetuate disparities. A readmission model that includes ZIP code or insurance can systematically under-allocate resources to disadvantaged populations.
— Step 3 increasingly tests recognition of fairness audits and the need to disaggregate model performance by race, sex, and SES.
— Many traditional models (eGFR, ASCVD) historically included race coefficients. Current guidance (e.g., 2021 NKF-ASN eGFR) removes race from eGFR; clinicians must recognize this transition and update order sets.
— When using a risk calculator (e.g., breast cancer Gail model, ASCVD) to guide decisions about chemoprevention or statins, present absolute risk and CI, not just relative measures.
— Document the discussion in the medical record.
— Logistic regression models built on EHR data require IRB approval and HIPAA-compliant data handling.
— Patients should be informed when their data fuel predictive analytics.
— Reporting of clinical research follows ICMJE/FDA mandates; misrepresenting an OR or selectively reporting predictors constitutes research misconduct.
— Discharge prediction tools (e.g., LACE for readmission) must be communicated to outpatient providers; failure to transmit risk scores is a documented care-transition safety gap.
Step 3 management: Before deploying any risk-prediction algorithm in your practice, ask (1) was it validated in patients like mine, (2) does it perform equitably across demographic subgroups, and (3) is the patient informed that an algorithm is contributing to their care plan? These three questions cover the bias, validation, and consent triad central to modern Step 3 ethics items.

Board pearl: Memorize the rule "case-control → OR; cohort/RCT → RR or HR; cross-sectional → prevalence OR." Matching study design to effect measure is the single most testable biostatistics pattern on Step 3, appearing in ~80% of regression-related stems.

— "After adjustment for age, BMI, and smoking, aOR for outcome was 2.4 (95% CI 1.6–3.6). Which is the best interpretation?"
— Correct: associated with higher odds, statistically significant, after adjustment; does not prove causation; OR ≠ RR.
— "aOR 1.8 (95% CI 0.9–3.6). Conclusion?"
— Correct: not statistically significant; cannot conclude an association.
— "Investigators want to study factors associated with 30-day readmission after MI in 5000 patients. Which analysis is most appropriate?"
— Correct: multivariable logistic regression (binary outcome). Distractors: linear regression, chi-square, Cox (would be correct only if time-to-readmission with censoring matters).
— High-prevalence outcome with reported OR; asked about true RR.
— Correct: OR overstates RR; true RR is smaller.
— Asked for the appropriate measure of association.
— Correct: odds ratio (RR uncomputable).
— Sicker patients more likely to receive drug; drug appears harmful in unadjusted analysis. Asked for the source of bias.
— Correct: confounding by indication; remedied by adjustment or propensity scores.
— Non-significant in a small subgroup despite overall significance. Asked for interpretation.
— Correct: inadequately powered, not "no effect."
— High C-statistic, poor calibration.
— Correct: discrimination good, calibration poor; unsuitable for individual risk prediction without recalibration.
Step 3 management: Before selecting an answer, restate the stem in your own words ("binary outcome, case-control, adjusted for X, CI excludes 1") — this disciplined translation eliminates 3 of 5 distractors on most biostatistics items.

Logistic regression models a binary outcome to yield odds ratios that are interpreted as the multiplicative change in odds per unit (or category) of a predictor, adjusted for the covariates in the model, statistically significant only when the 95% CI excludes 1, equivalent to relative risk only when the outcome is rare, and capable of demonstrating association but never causation.
Board pearl: When the Step 3 stem describes a binary outcome, adjusted analyses, and reports an OR with 95% CI — translate it as "X-fold odds, adjusted for these covariates, significant only if CI excludes 1, associated not causal" — that single sentence resolves the majority of logistic regression items you will encounter on test day.

