Biostatistics & Population Health
Linear regression: slope, intercept, and R-squared interpretation
— β₀ = intercept (predicted Y when X = 0)
— β₁ = slope (change in Y per 1-unit change in X)
— ε = residual error (assumed normally distributed with mean 0)
— Outcome is continuous (blood pressure, HbA1c, LDL, BMI, length of stay, FEV1)
— Predictor(s) can be continuous, ordinal, or categorical (dummy-coded)
— Goal is to quantify magnitude of association or predict a numeric value, not just test "is there a difference"
— Binary outcome (yes/no, dead/alive) → logistic regression (odds ratios)
— Time-to-event outcome → Cox proportional hazards (hazard ratios)
— Count outcome (ED visits/year) → Poisson regression
— Simple = one predictor
— Multiple = ≥2 predictors; allows adjustment for confounders (age, sex, comorbidity index)
Board pearl: If the outcome is continuous and the question asks "how much does Y change per unit X," the answer is slope (β₁) from linear regression — not correlation coefficient (r), which only describes direction/strength on a unitless −1 to +1 scale.

— "Investigators examined the association between daily sodium intake (mg) and 24-hour ambulatory systolic BP (mmHg)…"
— "A regression equation was derived: SBP = 105 + 0.012 × sodium (mg/day)"
— "The coefficient of determination was 0.28"
— Sample size (n) → drives precision of β estimates and width of CIs
— Units of X and Y → essential for interpreting slope magnitude
— Whether the model is unadjusted (crude) or adjusted for covariates
— Reported p-value vs 95% CI for β (CI crossing 0 = not statistically significant)
— "Best interpretation of the slope of 0.012?" → For every 1 mg increase in sodium, SBP rises by 0.012 mmHg on average
— "Best interpretation of intercept of 105?" → Predicted SBP when sodium intake = 0 mg/day (often biologically implausible — flag it)
— "What does R² = 0.28 mean?" → 28% of the variability in SBP is explained by sodium intake; the remaining 72% is due to other factors / unmeasured variance
— Confusing R² with r (correlation): r = √R² with the sign of the slope
— Confusing slope with relative risk or odds ratio (wrong model family)
— Assuming causation from a regression coefficient — regression quantifies association, not causation, unless the design supports it (RCT, instrumental variables)
Key distinction: A statistically significant slope (p < 0.05, CI excludes 0) tells you the association is unlikely due to chance, but says nothing about clinical importance — a slope of 0.012 mmHg per mg sodium is statistically real but may be clinically trivial at typical intake ranges. Always evaluate effect size × plausible exposure range before calling a finding meaningful.

— Scatterplot of Y vs X: should show roughly linear cloud, no obvious curve
— Residual vs fitted plot: points should scatter randomly around 0; a funnel shape = heteroscedasticity; a U-shape = nonlinearity
— Q-Q plot of residuals: points on the diagonal = normal residuals; S-curve = skewed errors
— Estimate (β): the slope or intercept
— Standard error (SE): precision of β
— t-statistic: β / SE
— p-value: tests H₀: β = 0
— 95% CI: β ± 1.96 × SE; if CI excludes 0, slope is significant
— Ranges 0 to 1
— Proportion of total variance in Y explained by the model
— R² = 0.0 → predictors explain nothing; R² = 1.0 → perfect fit
— In medicine, R² of 0.2–0.4 is common and not "bad" — biological variation is large
— A single extreme observation can dramatically pull the slope — assess via Cook's distance or leverage plots
— Stems may show a scatterplot with one obvious outlier and ask how removal affects the slope
Board pearl: R² ≠ accuracy of prediction for an individual. A model can have R² = 0.30 (explains 30% of population-level variance) yet predict any single patient's value poorly. For individual-level prediction precision, you need the prediction interval, which is always wider than the confidence interval around the mean response. Step 3 loves this distinction in research-vignette items.

— If X is in mg/day and Y in mmHg, slope is in mmHg per mg/day
— Multiplying X by 100 (e.g., per 100-mg increment) multiplies the reported slope by 100 — same biology, different presentation
— Positive β₁ → Y increases as X increases (direct association)
— Negative β₁ → Y decreases as X increases (inverse association)
— β₁ = 0 → no linear association (but could still have a nonlinear one)
— H₀: β₁ = 0 (no linear relationship)
— Reject if p < α (usually 0.05) OR if 95% CI for β₁ excludes 0
— Equivalent to t-test of β₁/SE against t-distribution with n−k−1 df
— Step 1: State units ("per 1-unit X, Y changes by β units")
— Step 2: Scale to a clinically meaningful X increment (e.g., per 10 mmHg, per 1 SD)
— Step 3: Compare to minimal clinically important difference (MCID)
— For binary X (0/1, e.g., smoker vs nonsmoker), slope = mean difference in Y between the two groups
— Equivalent to an independent t-test result when there are no covariates
— βᵢ represents change in Y per unit Xᵢ holding all other predictors fixed
— Allows separating effects of correlated exposures (e.g., BMI and waist circumference)
Step 3 management: When asked "best interpretation of the coefficient for smoking (β = 8.4) in a model predicting systolic BP adjusted for age and BMI," the answer is: smokers have an average SBP 8.4 mmHg higher than nonsmokers of the same age and BMI — not "smoking causes" and not "8.4% higher." Precision in wording wins points.

— In SBP = 105 + 0.012 × sodium, β₀ = 105 mmHg = predicted SBP at sodium intake of 0 mg/day
— Intercepts often correspond to impossible or extrapolated scenarios (sodium = 0, age = 0, weight = 0)
— Their numeric value is necessary for the equation to fit but should not be over-interpreted clinically
— If X is replaced by (X − mean X), the intercept becomes the predicted Y at the average value of X, which is interpretable
— Common in pediatric growth models, pharmacokinetics, and adjusted analyses
— Plug a specific X into Y = β₀ + β₁X to obtain a point estimate of Y
— Report uncertainty using a confidence interval (for the mean Y at that X) or a prediction interval (for a single new individual)
— Prediction interval >> CI because it adds residual variance
— Predicting Y for X values outside the observed range of the training data is unreliable — the linear relationship may not hold
— Step 3 may show a model built on adults age 30–70 then ask about predicted BP at age 12; the correct answer flags extrapolation
— Express slope in SD units of Y per SD of X
— Allow comparing relative importance of predictors measured in different units (e.g., comparing effect of age in years vs LDL in mg/dL)
Key distinction: Confidence interval answers "where is the true mean Y at this X?" Prediction interval answers "where will the next individual patient's Y land at this X?" The prediction interval is the right tool when counseling one patient about expected outcomes from a regression-based nomogram or risk calculator.

— SS_total = total variability in Y around its mean
— SS_residual = variability left after fitting the model
— Interpretation: proportion of variance in Y explained by the predictor(s)
— "R² = 0.45 means 45% of the variability in LDL is explained by dietary fat intake"
— Does NOT mean "45% of patients are correctly classified" (that's a classification accuracy concept — wrong model family)
— Does NOT mean "45% chance of causation"
— In simple linear regression, R² = r² (square of Pearson correlation)
— So r = 0.6 → R² = 0.36 (36% of variance explained)
— Sign of r matches sign of slope; R² is always non-negative
— Adjusts for number of predictors (k) and sample size (n)
— Adding random noise variables raises raw R² but can lower adjusted R²
— Preferred metric when comparing multivariable models
— Behavioral/lifestyle outcomes: R² 0.05–0.25 typical (many unmeasured factors)
— Physiologic/lab outcomes with strong mechanistic predictors: R² 0.4–0.8
— Prediction models for individual risk should also report calibration and discrimination (C-statistic), not just R²
— A predictor can have a highly significant slope and small R² if the sample is large
— Conversely, small samples can show large R² by chance (overfitting)
Board pearl: When a vignette pairs p < 0.001 with R² = 0.04, the correct reading is: the association is real (unlikely chance) but the predictor explains only 4% of outcome variability — clinically modest. Step 3 distractors will conflate significance with magnitude; pick the option that separates them.

— Which predictors to include — guided by prior knowledge, DAGs, confounders, not just p-values
— Whether to include interaction terms (X₁ × X₂) — tests if the effect of one predictor depends on another (effect modification)
— Whether to model nonlinear effects via polynomial terms (X²) or splines
— Include known confounders even if their own p-value is non-significant — purpose is to adjust the exposure-outcome estimate
— "Table 2 fallacy": don't interpret adjusted coefficients of covariates as causal effects of those covariates on Y
— When predictors are highly correlated (e.g., weight and BMI), coefficients become unstable, with inflated SEs and wide CIs
— Diagnosed with Variance Inflation Factor (VIF); VIF > 5–10 = problematic
— Fix by dropping one predictor, combining them, or using regularization (ridge/LASSO)
— Model: Y = β₀ + β₁X₁ + β₂X₂ + β₃(X₁×X₂)
— Effect of X₁ on Y depends on level of X₂; report stratified slopes
— Right-skewed Y (e.g., triglycerides, hospital costs) often log-transformed to meet normality/homoscedasticity assumptions
— After log transformation, slope = approximate % change in Y per unit X
Step 3 management: If a vignette describes a model where adding "waist circumference" to a model already containing "BMI" causes both coefficients to lose significance and SEs to balloon, the diagnosis is multicollinearity, and the recommended action is to remove one of the redundant predictors or use a composite — not to keep adding more variables.

— Each β has a t-test: t = β / SE(β), df = n − k − 1
— Reported automatically in regression output
— Significant slope = predictor contributes to explaining Y given the other predictors in the model
— F-test (overall model): H₀: all slopes = 0
— F = (explained variance / k) / (residual variance / (n − k − 1))
— Significant F → at least one predictor is associated with Y
— Partial F-test: does adding a block of predictors significantly improve fit?
— Useful when testing whether interaction terms or a set of dummy variables (e.g., 4 racial/ethnic categories) add value
— 95% CI: β ± t_{0.975, df} × SE(β)
— Excludes 0 ↔ p < 0.05
— Width reflects precision; narrow CI from larger n
— Step 3 vignettes increasingly model real journal output; expect to see β, 95% CI, p-value, R², and adjusted R²
— Multiple testing across many predictors inflates type I error — consider Bonferroni or pre-specification
— Subgroup interaction tests are underpowered; significant subgroup p-values without interaction p-values are misleading
— Reporting only standardized β without raw β obscures clinical units
CCS pearl: When a research-vignette CCS-style item gives you an output table with β = 2.3, 95% CI 0.4 to 4.2, p = 0.02, the correct interpretation is: "Each 1-unit increase in X is associated with a 2.3-unit increase in Y; we are 95% confident the true increase is between 0.4 and 4.2, and this is unlikely due to chance." Do not say "X causes Y" without supporting study design.

— Slope estimates become unstable; CIs widen
— Normality of residuals matters more (central limit theorem can't rescue inference)
— Consider bootstrap CIs rather than t-based CIs
— Length of stay, costs, biomarker concentrations are right-skewed
— Untransformed linear regression violates normality of residuals and often homoscedasticity
— Solutions:
— Log transformation of Y (interpret β as approx % change per unit X)
— Generalized linear models (gamma, negative binomial)
— Quantile regression for median rather than mean
— Residual spread changes across X (e.g., variance of cholesterol grows with age)
— Coefficients remain unbiased, but SEs and CIs are wrong
— Fix with robust (sandwich) standard errors or weighted least squares
— Standard linear regression assumes independence; violation underestimates SEs
— Use mixed-effects models or GEE for repeated measures, multi-site, or family data
— Subgroup analyses should pre-specify interaction terms rather than post-hoc stratified slopes
— Slopes derived from subgroups have less power and wider CIs
Board pearl: When a Step 3 stem highlights that hospital cost data are "right-skewed with a long tail," and a researcher uses ordinary least-squares linear regression on raw costs, the predictable finding is non-normal residuals and biased inference — the right move is log transformation (and back-transformation for reporting) or a gamma GLM.

— Outcomes (height, weight, head circumference) are inherently nonlinear with age
— Use splines, polynomial regression, or LMS method rather than simple linear regression
— Age-centering or z-score transformation aids interpretability
— Outcomes (BP, hemoglobin, fundal height) change predictably by gestational age
— Repeated measures within the same woman require mixed-effects (random-intercept) models — a hierarchical extension of linear regression
— Including race as a dummy variable estimates average differences after adjustment — does not establish biological cause
— Modern epidemiology favors modeling upstream determinants (income, access, exposure) when feasible
— If biology differs (e.g., HDL by sex), include an interaction term: sex × predictor, or fit sex-stratified models
— Pre-specify in the analysis plan to avoid data dredging
— Linear assumptions may fail at extremes (e.g., BP-mortality is U-shaped in elderly) — model with splines or polynomial terms
— Truncated ranges reduce external validity of slope estimates
— Often log-transform dose; slope on log-dose = change in response per fold-increase in dose
Key distinction: A stratified analysis fits separate models in subgroups and visually compares slopes; an interaction term in a single combined model formally tests whether subgroup slopes differ and provides a p-value for effect modification. Step 3 prefers interaction testing for claims about "different effects in men vs women."

— Regression coefficients describe association conditional on covariates included; unmeasured confounding can still bias estimates
— Causal inference requires design (RCT) or methods (instrumental variables, propensity scores, target trial emulation)
— Too many predictors relative to n → model fits noise; R² inflates, but performance on new data collapses
— Detect with cross-validation or hold-out test sets; use adjusted R² for model comparison
— A regression at the population level (e.g., per-country sodium vs BP) doesn't imply individual-level effects of the same size
— Linear regression doesn't enforce direction; Y could be causing X
— Studying only severe cases truncates X or Y, attenuating the slope and lowering R²
— A single high-leverage point can flip the sign of a slope
— Always inspect residuals, leverage, and Cook's distance
— Complete-case analysis is biased if missingness depends on X or Y
— Use multiple imputation when missing at random
— Running 20 predictors with α = 0.05 expects ~1 false positive by chance
— Trying multiple model specifications until p < 0.05 inflates false-positive rate
— Pre-registration and analysis plans mitigate this
Board pearl: When a study reports an enormous R² (>0.95) on a small clinical dataset with many predictors, suspect overfitting, not a brilliant model. Ask for out-of-sample validation or adjusted R²; without those, the finding likely won't replicate.

— Outcome is binary → logistic regression (odds ratio per unit X)
— Outcome is time-to-event with censoring → Cox proportional hazards (HR per unit X)
— Outcome is a count (events/time) → Poisson or negative binomial regression
— Outcome is ordinal with many categories → ordinal logistic regression
— Outcome is bounded (proportion, 0–1) → beta regression or logistic transform
— Curved residual pattern → add polynomial terms (X², X³) or use restricted cubic splines
— Threshold/saturation effects (e.g., dose-response plateau) → piecewise or nonlinear models
— Same patient measured over time → linear mixed-effects (random intercepts/slopes) or GEE
— Patients nested in hospitals → multilevel models with hospital random effect
— Use robust regression (M-estimators) or rank-based methods
— Add internal validation (bootstrap, k-fold CV)
— Consider regularized regression (LASSO, ridge, elastic net) for high-dimensional predictors
— Evaluate with RMSE, MAE, calibration plots — not just R²
— Bayesian regression with informative priors stabilizes estimates
Step 3 management: If a stem describes researchers using linear regression to predict 30-day mortality (binary) from APACHE II score, the correct critique and recommendation is: outcome is binary, so logistic regression should be used, reporting odds ratios and a C-statistic — not slope and R².

— Measures strength and direction of linear association; unitless, −1 to +1
— In simple linear regression, r² = R²
— Tells you how tightly points cluster around the line, not the slope's magnitude
— Nonparametric; uses ranks; robust to outliers and nonlinearity (monotonic)
— Use when distributions are skewed or relationship is monotonic but not linear
— Simple: one X, slope = unadjusted association
— Multiple: multiple Xs, each slope = adjusted association holding others fixed
— Special case of linear regression where all predictors are categorical
— F-test in ANOVA = F-test of the regression model
— One-way ANOVA = linear regression with one categorical predictor (k−1 dummies)
— Linear regression with one categorical predictor (group) and continuous covariates
— Reports adjusted group means and tests group difference controlling for covariates
— Equivalent to linear regression with a single binary predictor
— Slope = mean difference between groups
— Linear in coefficients but allows nonlinear curves in X via X², X³ terms
Key distinction: Correlation quantifies how points hug a line (tightness, direction) but never gives a clinically usable equation. Regression gives you both an equation (slope, intercept) for prediction and a measure of fit (R²). On Step 3, if the question asks "how much does Y change per unit X" the answer is regression slope; if it asks "how strong is the relationship" with no units, the answer is correlation r.

— Binary outcomes (event/no event)
— Coefficients = log-odds ratios; exponentiate to get OR
— Equivalent "fit" measures: pseudo-R² (Nagelkerke, McFadden), C-statistic, Hosmer-Lemeshow
— Time-to-event with censoring (survival)
— Coefficients = log hazard ratios; exponentiate to get HR
— Assumes proportional hazards over follow-up
— Count outcomes (admissions/year, falls/month)
— Slope = log rate ratio; exponentiate to get rate ratio
— Negative binomial preferred when count data are overdispersed
— Umbrella that includes linear, logistic, Poisson via different link functions
— Linear regression = GLM with identity link and Gaussian distribution
— Models intrinsically nonlinear functions (e.g., Michaelis-Menten, exponential decay in PK)
— Different from polynomial regression, which is linear in parameters
— Capture nonlinearity and interactions automatically
— Trade interpretability for predictive performance
— Report variable importance, partial dependence, and out-of-sample RMSE — not β or R² in the classic sense
— For nested or repeated-measures data
— Includes random intercepts and/or random slopes
Board pearl: A vignette describing patients followed over time for hospital readmission within 90 days (yes/no, with some censored due to death) should trigger survival analysis (Cox) — not linear regression on "days to readmission," because censoring is ignored by ordinary linear regression and biases the slope.

— Sample size, number and choice of predictors with rationale
— Coefficients (β) with 95% CIs and p-values for each predictor
— Intercept β₀ (with centering noted if applied)
— R² and adjusted R²
— Model diagnostics: residual plots, tests of assumptions, VIF
— Handling of missing data
— Internal validation: bootstrap, cross-validation on derivation sample
— External validation: apply model to a new dataset; report calibration slope (ideally ~1) and discrimination
— Recalibration (adjust intercept), revision (re-estimate slopes), or refitting if population drifts
— Convert to scaled increments patients understand: "per 10-pound weight loss, your SBP drops by ~5 mmHg on average"
— Pair point estimate with uncertainty range (CI)
— Nomograms and EHR calculators are commonly underpinned by linear (or logistic) regression
— Tools should display prediction intervals, not just point predictions, when used for individual counseling
Step 3 management: When deploying a regression-based risk calculator in clinic, confirm that the derivation cohort matches your patient population (age, sex, comorbidity), that the tool has been externally validated, and that you communicate the uncertainty interval rather than a single number — analogous to discharge medication counseling on expected effect size and variability.

— Population characteristics change (aging, new therapies, changing risk-factor prevalence)
— Measurement methods evolve (new lab assays, imaging modalities)
— Care patterns shift (treatment effects alter outcome distributions)
— Recompute calibration (predicted vs observed means) in current data
— Compare current R² and residual SD to derivation values
— Recheck assumptions: linearity, homoscedasticity, normality of residuals
— Recalibration in the large: shift intercept to match new mean outcome
— Recalibration of slope: rescale all coefficients by a single factor
— Full refitting: re-estimate all coefficients on new data
— Track model version, derivation date, validation studies
— Required for FDA-regulated clinical decision software (Software as a Medical Device)
— Emphasize that a regression equation provides a population-based estimate; an individual may fall above or below
— Reinforce that R² < 1 means significant unexplained variability — outcomes are uncertain
— Use regression coefficients to identify modifiable predictors to target
— Track outcome trends with regression on time (slope = rate of change over months/years)
Board pearl: When a clinic's diabetes management dashboard uses a regression model predicting HbA1c trajectory, periodic calibration audits (predicted vs observed HbA1c at 6 months) are the right monitoring step — analogous to following BP after starting an antihypertensive. If observed values systematically exceed predicted, the model needs recalibration, not abandonment.

— Regression models trained on non-representative data can systematically misestimate outcomes in underrepresented groups (race, sex, age, insurance status)
— Including race as a predictor raises ethical concerns when used to allocate care (e.g., historic eGFR equations); current guidance favors race-neutral equations
— Participants must understand how their data will feed predictive models
— Secondary use of clinical data for model derivation may require IRB review and waiver of consent under HIPAA
— Clinicians retain responsibility for decisions even when guided by a regression-based calculator
— Documentation should reflect clinical judgment, not just algorithm output
— Patients have a right to know when a prediction influencing their care comes from an algorithm
— "Black-box" models with hidden coefficients are increasingly disfavored in high-stakes clinical use
— Researchers/sponsors selecting predictors or transformations to favor a desired finding (p-hacking) is a research integrity violation
— Pre-specified statistical analysis plans mitigate this
— A risk score generated in inpatient settings (e.g., readmission prediction) must be communicated clearly to outpatient teams; misinterpretation of probabilities can drive over- or under-treatment
— Significant predictor relationships that imply public health risks (e.g., environmental exposure–disease associations) may trigger reporting obligations to public health authorities
Step 3 management: A primary care clinician using an EHR-embedded regression model to predict cardiovascular risk should: (1) verify the model is validated in the patient's demographic, (2) share the predicted risk and its uncertainty with the patient as part of shared decision-making, (3) document that the recommendation reflects integrated clinical judgment — not solely the algorithm output — to satisfy both ethical and medico-legal standards.

Key distinction: Memorize the slope vs R² split: slope tells you the magnitude of effect, R² tells you how much variability is explained. Step 3 distractors deliberately swap these — pick wording that respects each metric's specific meaning.

— Stem: "In a study of 500 adults, the regression equation was: LDL = 80 + 1.2 × (saturated fat g/day). Best interpretation of 1.2?"
— Answer: "For every additional 1 gram of daily saturated fat intake, LDL increases by 1.2 mg/dL on average."
— Answer: "Predicted LDL when saturated fat intake is 0 g/day" — note biological implausibility
— Stem: "The model had R² = 0.18."
— Answer: "18% of the variability in LDL is explained by saturated fat intake" — not "18% of patients" and not "r = 0.18"
— Stem: "Slope = 0.04 mmHg per mg sodium, p < 0.001, n = 50,000"
— Answer: Real but small effect at typical intake; significance driven by large n
— Stem: Researchers use linear regression to predict 30-day mortality (yes/no)
— Answer: Use logistic regression for binary outcomes
— Stem: Crude slope changes substantially after adjusting for age
— Answer: Age was a confounder; adjusted coefficient is preferred
— Stem: Adding waist circumference to a BMI model balloons SEs
— Answer: Multicollinearity; drop one redundant predictor
— Stem: Model built in adults applied to adolescents
— Answer: Extrapolation invalid; refit in the target population
— Stem: Counseling one patient using model output
— Answer: Use prediction interval, not confidence interval
Board pearl: Whenever a stem provides a regression equation, immediately label β₀ (intercept), β₁ (slope), and identify units. Whenever it provides R², restate it as "% of variance in Y explained." These two reflexes solve the majority of linear regression Step 3 items.

Linear regression fits Y = β₀ + β₁X + ε to a continuous outcome: the slope (β₁) gives the average change in Y per unit X, the intercept (β₀) gives predicted Y when X = 0, and R² gives the proportion of variance in Y explained — none of which alone imply causation, clinical importance, or accurate individual prediction.
Board pearl: On Step 3, the highest-yield reflex is to separate three concepts that distractors deliberately blur: slope = magnitude, p-value/CI = chance, and R² = variance explained — answer every linear regression item by naming which one the stem is actually asking about, then translate the number into plain clinical English with correct units and an explicit reminder that association ≠ causation without supporting study design.

