Biostatistics & Population Health
ANOVA: one-way and repeated measures interpretation
— One-way ANOVA: 1 categorical independent variable (factor) with ≥3 levels, 1 continuous dependent variable
— Example: Mean HbA1c across patients on metformin vs sulfonylurea vs GLP-1 agonist vs placebo
— Repeated-measures ANOVA: same subjects measured ≥3 times or under ≥3 conditions (within-subject design)
— Example: Mean systolic BP at baseline, 6 weeks, 12 weeks in the same hypertensive cohort
— Each t-test at α=0.05 carries 5% type I error; doing 6 pairwise tests inflates family-wise error to ~26%
— ANOVA preserves overall α at 0.05 with a single omnibus F-test
— F = (between-group variance) / (within-group variance)
— Large F → groups differ more than random noise within groups → reject H₀ that all means are equal
Board pearl: If a stem says "researchers compared mean values across four treatment arms," the answer is one-way ANOVA, not multiple t-tests, not chi-square. Chi-square is for categorical outcomes; t-test is for 2 groups only. Recognizing the structure of the data (number of groups, type of outcome, independent vs paired) is the entire game for Step 3 biostats questions.

— "Three (or more) groups" + "mean," "average," or a continuous variable → ANOVA territory
— "Same patients followed over time" or "measured before, during, and after" → repeated-measures ANOVA
— "Independent groups" or "randomized to one of four arms" → one-way ANOVA
— RCT comparing mean LDL reduction across atorvastatin, rosuvastatin, simvastatin, and placebo (one-way)
— Pain scores recorded at 0, 30, 60, 90 minutes after analgesia in the same ED cohort (repeated-measures)
— Mean weight loss across 5 bariatric procedure types in different patient cohorts (one-way)
— Same COPD patients' FEV1 measured pre-bronchodilator, 1 hr post, and 24 hr post (repeated-measures)
— Number of groups stated explicitly ("four arms," "three diets")
— Outcome described with words like mean, average, change in, level of
— Phrase "same subjects" or "each participant served as their own control" = repeated measures
— Phrase "randomly assigned" to distinct cohorts = independent groups → one-way
— "Proportion who responded" → categorical → chi-square, not ANOVA
— "Survival time" → log-rank/Cox, not ANOVA
— "Two groups only" → t-test, not ANOVA
Key distinction: Independence of observations is what separates one-way from repeated-measures ANOVA. If a single patient contributes one data point → one-way. If a single patient contributes multiple data points across time/conditions → repeated-measures. Mis-applying one-way ANOVA to paired data violates independence assumptions and inflates type I error — a classic Step 3 "what's wrong with this analysis" stem.

— Independence of observations across and within groups (no patient in two arms; no clustering)
— Normality of the outcome within each group (or large enough n per group via CLT, typically ≥30)
— Homogeneity of variance (homoscedasticity) — variances roughly equal across groups; tested via Levene's test or Bartlett's test
— Continuous (interval/ratio) dependent variable
— Categorical independent variable with ≥3 levels
— Sphericity — variances of the differences between all pairs of within-subject conditions are equal
— Tested via Mauchly's test; if violated, apply Greenhouse-Geisser or Huynh-Feldt correction to degrees of freedom
— Look at group sample sizes — balanced designs are robust; very unbalanced designs amplify variance violations
— Examine standard deviations — if largest SD is >2× smallest SD, homogeneity is suspect
— Plot residuals; check histograms for skew
— Severe non-normality with small n → Kruskal-Wallis (non-parametric one-way analog)
— Repeated measures with non-normal data → Friedman test (non-parametric repeated-measures analog)
— Unequal variances → Welch's ANOVA
Board pearl: A Step 3 stem describing pain scores (ordinal 0–10) across 3 groups with skewed distributions and n=12 per arm should NOT use one-way ANOVA — the correct answer is Kruskal-Wallis. Recognize when ordinal or non-normal data invalidate parametric ANOVA.

— F = MS_between / MS_within
— MS_between (mean square between groups) = variance attributable to group membership
— MS_within (mean square within groups, aka error) = random variability among subjects in the same group
— Under H₀ (all means equal), F ≈ 1; under H₁, F >> 1
— df_between = k − 1 (k = number of groups)
— df_within = N − k (N = total sample size)
— Example: 4 groups, 20 patients each → df = (3, 76)
— p < 0.05 → reject H₀ → at least one group mean differs from at least one other
— p ≥ 0.05 → fail to reject; conclude insufficient evidence of any difference
— A significant F does not tell you which groups differ — that requires post-hoc testing
— Total SS = Between-group SS + Within-group SS (one-way)
— Repeated-measures: Total SS = Between-subjects SS + Within-subjects SS (treatment + error) — subject-level variance is removed from error, increasing power
— e.g., F(2, 87) = 6.42, p = 0.003
Step 3 management: When a vignette shows F(3, 96) = 4.81, p = 0.004, the correct next step is post-hoc pairwise comparisons (Tukey HSD most common) — not concluding which specific drug is best, and not running individual t-tests without correction. Recognize that the omnibus test is a screen; post-hoc is the confirmatory localization.

— Most commonly tested on Step 3
— Used when all pairwise comparisons are of interest
— Balances type I error and power well in balanced designs
— Divides α by number of comparisons (α/m)
— Conservative; reduces power; simple to compute
— Best when only a few pre-planned comparisons
— Most conservative; allows complex contrasts (e.g., average of groups A+B vs C)
— Lower power, but flexible
— Specifically for comparing multiple treatments against a single control
— Ideal in placebo-controlled multi-arm RCTs
— Step-down procedures, more powerful than straight Bonferroni
— η² (eta-squared) = SS_between / SS_total; proportion of variance explained
— Partial η² for repeated measures
— Cohen's f: small 0.10, medium 0.25, large 0.40
Key distinction: Bonferroni vs Tukey — Bonferroni is a generic multiple-comparison correction applicable to any set of tests; Tukey is specifically designed for all pairwise comparisons of group means after ANOVA and is typically more powerful for that exact purpose. On Step 3, if the stem says "researchers wanted to compare every group with every other group," Tukey is the cleanest answer.

— Yes → proceed with ANOVA family
— No (ordinal/skewed, small n) → Kruskal-Wallis (independent) or Friedman (repeated)
— One factor with ≥3 levels → one-way ANOVA
— Two factors (e.g., drug × sex) → two-way ANOVA, which also tests interaction effects
— Three+ factors → factorial ANOVA
— Independent → one-way or two-way ANOVA
— Same subjects across conditions/time → repeated-measures ANOVA
— Mix of between- and within-subject factors → mixed-design ANOVA (e.g., drug A vs B, measured at 3 timepoints)
— Add continuous covariates → ANCOVA (analysis of covariance); useful for adjusting for baseline values, age, BMI
— MANOVA (multivariate ANOVA) — e.g., simultaneous comparison of systolic BP, diastolic BP, and heart rate across groups
— Sample size per group should typically be ≥10–20 for ANOVA to be robust
— Repeated measures designs gain power by removing between-subject variance — fewer subjects needed than equivalent one-way design
Board pearl: A two-way ANOVA's most powerful feature is the interaction term. If a stem asks "does the effect of the drug differ by sex?", the answer requires two-way ANOVA with interaction, not two separate one-way ANOVAs. The interaction p-value answers the differential-effect question directly.

— Captures how far each group mean deviates from the grand mean, weighted by group size
— Captures residual variability within each group
— MS_B = SS_B / (k − 1)
— MS_W = SS_W / (N − k)
— 3 drugs, 10 patients each, comparing mean LDL reduction
— Group means: 30, 45, 50 mg/dL; grand mean 41.67
— Suppose SS_B = 2333, SS_W = 1620
— MS_B = 2333/2 = 1166.5; MS_W = 1620/27 = 60
— F = 1166.5/60 = 19.4; df = (2, 27); p < 0.001 → reject H₀
Step 3 management: When given ANOVA output in a table, focus on three numbers: the F-statistic, the p-value, and the df pair. Confirm df_between = k−1 matches the number of groups described. A mismatch (e.g., stem says 4 drugs but df_between = 2) signals the analysis was misapplied or you're misreading the design.

— Total SS = Between-subjects SS + Within-subjects SS
— Within-subjects SS is further split into treatment/time SS and error SS
— Removing between-subject variability from the error term increases statistical power — a major advantage when subjects vary widely at baseline
— Variances of the differences between every pair of repeated conditions must be approximately equal
— Violations are common when measurements are closer in time (autocorrelation)
— Mauchly's test screens for sphericity; p < 0.05 indicates violation
— Greenhouse-Geisser ε — multiplies df by ε (more conservative)
— Huynh-Feldt ε — used when GG ε > 0.75 (less conservative)
— Both shrink df, raising the critical F threshold and protecting type I error
— DV: systolic BP; within-subject factor: time (4 levels)
— Output: F(3, 72) = 14.2, p < 0.001 → mean BP differs across at least one timepoint pair
— Post-hoc paired comparisons with Bonferroni identify which timepoints differ from baseline
CCS pearl: In longitudinal clinical research (post-MI ejection fraction at 1, 3, 6, 12 months in the same cohort), repeated-measures ANOVA — not separate one-way ANOVAs at each timepoint — is the correct framework. Separate analyses ignore within-subject correlation and waste power.

— Central Limit Theorem cannot rescue normality assumption
— Shapiro-Wilk test on residuals helps confirm; visual inspection of Q-Q plots
— If non-normal → use Kruskal-Wallis test (rank-based, non-parametric one-way analog)
— Kruskal-Wallis tests whether medians/distributions differ across ≥3 independent groups; post-hoc via Dunn's test
— Use Friedman test (non-parametric repeated-measures analog)
— Post-hoc: Wilcoxon signed-rank pairwise comparisons with Bonferroni
— Welch's ANOVA adjusts degrees of freedom and is robust to variance inequality
— Post-hoc: Games-Howell test
— Type III sums of squares preferred over Type I in regression-based ANOVA frameworks
— Power skewed toward larger groups
— ANOVA F-test is sensitive to extreme values
— Investigate clinically (data entry error vs true biologic outlier)
— Sensitivity analyses excluding outliers can demonstrate robustness
— Strictly, ANOVA is inappropriate; non-parametric alternatives or ordinal regression are correct
— Many published studies use ANOVA anyway when scales have ≥7 levels and look continuous — common Step 3 critique
Key distinction: Kruskal-Wallis vs one-way ANOVA — same study question (≥3 group comparison), different data types. Kruskal-Wallis when outcome is ordinal, skewed, or n small. ANOVA when continuous, normal, and adequately powered. A stem describing "physician-rated symptom severity (mild/moderate/severe)" across 4 treatment arms should trigger Kruskal-Wallis, not ANOVA.

— Randomization at clinic/hospital level, not patient level (e.g., 10 clinics randomized to 1 of 3 quality-improvement strategies)
— Patients within a clinic are correlated → cannot treat each patient as independent
— Correct analysis: mixed-effects (multilevel) models or GEE with cluster as random effect; not naive one-way ANOVA
— Intra-cluster correlation must be accounted for
— Site can be a fixed or random effect; ignoring it inflates type I error if site effects exist
— Growth curves over time → mixed models or repeated-measures ANOVA preferred over cross-sectional one-way at each visit
— Pre/post-partum measurements on same women → repeated-measures framework
— Dropouts in long follow-up studies violate complete-case assumption of classical repeated-measures ANOVA
— Mixed-effects models handle missing-at-random data better and have largely replaced repeated-measures ANOVA in modern analyses
— Each patient receives multiple treatments in sequence → within-subject design → repeated-measures ANOVA or paired analyses
— Carryover and period effects must be assessed
Board pearl: When a Step 3 stem describes a cluster-randomized trial and analyzes outcomes with one-way ANOVA treating individual patients as independent units, the methodological flaw is ignoring within-cluster correlation, which falsely inflates effective sample size and increases type I error. The correct analysis uses mixed/multilevel models.

— Inflates family-wise type I error; 6 pairwise tests at α=0.05 → ~26% chance of at least one false positive
— Correct: omnibus ANOVA first, then post-hoc with correction
— Omnibus only says "at least one differs"; without post-hoc, you cannot identify which
— Without Greenhouse-Geisser correction when warranted → inflated false-positive rate
— Violates independence; correct analysis is repeated-measures ANOVA
— Large N can detect trivial mean differences (e.g., 1 mmHg BP across drugs) — report effect sizes (η², Cohen's f) and confidence intervals
— Running ANOVA on 20 different outcomes without correction → some will be "significant" by chance
— Use MANOVA or Bonferroni-style adjustment across outcomes
— Converting continuous data to categorical (e.g., HbA1c into controlled/uncontrolled) loses information and statistical power
— Running every possible subgroup ANOVA and reporting only significant ones = HARKing; misleading
— ANOVA across naturally occurring groups (e.g., smokers vs non-smokers vs ex-smokers) shows association, not causation
Step 3 management: When you see "researchers performed six independent t-tests to compare four treatment groups," the correct critique is family-wise type I error inflation, and the appropriate corrective analysis is one-way ANOVA followed by Tukey HSD (or equivalent). Recognize this pattern instantly.

— Example: comparing mean weight loss across 3 diets while adjusting for baseline weight
— Improves precision and reduces bias from baseline imbalances
— Tests main effects of each factor and their interaction
— Example: drug (3 levels) × sex (2 levels) on BP — does drug effect differ by sex?
— Example: simultaneous comparison of SBP, DBP, HR across 4 antihypertensives
— Controls error across outcomes; follow significant MANOVA with univariate ANOVAs
— Not ANOVA territory; censoring requires survival methods
— Patients nested in clinics, repeated measures with missing data, longitudinal trajectories
CCS pearl: When a study question involves longitudinal data with missing observations, patient-level random effects, or time-varying covariates, escalate from repeated-measures ANOVA to linear mixed-effects models (LMM). LMMs are the modern standard for longitudinal clinical research and are increasingly the "correct answer" on advanced biostats stems.

— Compares a single sample mean to a known/hypothesized value (e.g., is mean HbA1c in clinic ≠ 7.0%?)
— Compares means of exactly 2 independent groups with continuous, normally distributed outcome
— If only 2 groups → t-test, NOT ANOVA (though one-way ANOVA with 2 groups is mathematically equivalent: F = t²)
— Compares means of 2 measurements in the same subjects (before/after)
— Repeated-measures ANOVA with 2 timepoints is equivalent
— Two-group comparison when variances are unequal
— ≥3 independent groups, continuous outcome
— ≥2 factors; tests main effects and interactions
— ≥3 measurements in same subjects
— Combines between-subject and within-subject factors
— ANOVA + continuous covariate adjustment
— Multiple continuous DVs simultaneously
— 1 group vs known value → 1-sample t
— 2 independent groups → independent t-test
— 2 paired measurements → paired t-test
— ≥3 independent groups → one-way ANOVA
— ≥3 paired/repeated measurements → repeated-measures ANOVA
Key distinction: Why not just do an ANOVA on 2 groups? Mathematically it works (F = t²), but t-test is conventional and conceptually clearer. Step 3 expects t-test for 2 groups and ANOVA for ≥3. Choosing the wrong category (e.g., ANOVA when only 2 groups exist) is a flagged error pattern.

— Non-parametric one-way ANOVA analog
— Use for: ordinal outcomes, skewed continuous data, small samples
— Tests whether distributions/medians differ across ≥3 independent groups
— Post-hoc: Dunn's test with Bonferroni
— Non-parametric repeated-measures ANOVA analog
— ≥3 paired/repeated measurements with ordinal or non-normal data
— Post-hoc: pairwise Wilcoxon signed-rank with Bonferroni
— Non-parametric two-group comparison (analog of independent t-test)
— Non-parametric paired comparison (analog of paired t-test)
— Categorical outcome × categorical predictor; e.g., proportion cured across 3 treatment arms
— NOT for continuous outcomes
— Chi-square alternative when expected cell counts <5
— Paired categorical data (e.g., agreement between two diagnostic tests on same patients)
— Extension of McNemar to ≥3 paired binary outcomes
— Compares survival curves across ≥2 groups
— Models continuous outcome on continuous or categorical predictors; ANOVA is a special case of linear regression where predictors are categorical
Board pearl: "Proportion of patients achieving remission across 4 chemotherapy regimens" → chi-square, not ANOVA. ANOVA's outcome must be continuous. If the outcome is count of events, yes/no, cured/not cured, or survival time, you're outside ANOVA's domain. Distinguishing outcome type (continuous vs categorical vs time-to-event) is the single most testable biostats concept on Step 3.

— Define hypotheses, primary outcome, and analysis plan before data collection
— Distinguish primary from secondary/exploratory analyses
— Prevents p-hacking and HARKing
— Report group means, SDs, and Ns before inferential tests
— Include sample size per group; flag any imbalances
— Normality (Shapiro-Wilk, Q-Q plots)
— Homogeneity of variance (Levene's)
— Sphericity (Mauchly's) for repeated measures
— State which assumption tests passed/failed and any corrections applied
— F(df_between, df_within) = value, p = value, η² = value
— Confidence intervals for mean differences
— Specify which method (Tukey, Bonferroni, etc.) and why
— Report adjusted p-values
— η², partial η², or Cohen's f
— Distinguishes statistically significant from clinically meaningful
— Run analyses with/without outliers, with/without missing data imputed
— Avoid publication bias by reporting all pre-specified outcomes
— Pre-study power calculation: typical target 80% power at α=0.05 to detect a clinically meaningful effect
Step 3 management: When evaluating a published study using ANOVA, scan for: (1) clear group definitions, (2) assumption testing, (3) F-statistic with full df, (4) post-hoc method named, (5) effect size reported, (6) clinical significance discussed. Missing any of these is a quality flag worth noting in critical-appraisal stems.

— Power = 1 − β = probability of detecting a true effect; target ≥80%
— Inputs: α (typically 0.05), effect size (Cohen's f), number of groups, desired power → outputs required N per group
— Cohen's f benchmarks: 0.10 small, 0.25 medium, 0.40 large
— Rule of thumb: medium effect, 4 groups, 80% power → ~45 per group
— High within-subject correlation amplifies this advantage
— η² (eta-squared) = SS_between / SS_total
— 0.01 small, 0.06 medium, 0.14 large
— Partial η² = SS_effect / (SS_effect + SS_error); preferred in factorial and repeated-measures designs
— ω² (omega-squared) = less biased alternative to η², especially in small samples
— Width reflects precision; narrower with larger N
— CIs that exclude 0 align with p < 0.05 for that pairwise comparison
— Statistically significant (p < 0.05) ≠ clinically meaningful
— A 0.1% HbA1c reduction across 5000 patients may reach p < 0.001 but be clinically trivial
— Minimum clinically important difference (MCID) anchors interpretation
— Risk of type II error (false negative) and inflated effect estimates if significant ("winner's curse")
Board pearl: Always pair p-value interpretation with effect size and confidence intervals. A stem reporting F(2, 297) = 4.10, p = 0.018, η² = 0.027 shows statistical significance but small effect size (~2.7% of variance explained) — the test is not clinically practice-changing despite the p-value. Step 3 increasingly rewards this nuanced reading.

— Ethically, RCTs should pre-register hypotheses and primary outcomes (ClinicalTrials.gov)
— Post-hoc analyses cherry-picked from non-significant primary outcomes are misleading and threaten patient safety when they drive practice change
— Running multiple ANOVAs and selectively reporting significant ones inflates false-positive rates
— Constitutes research misconduct when intentional; misleads downstream clinicians
— Patients enrolling in 4-arm RCTs comparing treatments must understand they may receive placebo or older therapy
— Equipoise must exist among arms; if one is clearly superior pre-trial, randomization is unethical
— Pre-specified interim ANOVA analyses with stopping rules (e.g., O'Brien-Fleming boundaries) protect patients if early evidence shows harm or overwhelming benefit
— Unplanned peeks at data inflate type I error
— Should be pre-specified; otherwise risk spurious findings that could harm specific populations if practice changes inappropriately
— Negative and non-significant ANOVA results must be reported to avoid publication bias
— Selective reporting harms patient safety by skewing evidence base
— Industry-sponsored studies must disclose funding; analytic choices (e.g., post-hoc test selection, exclusion of outliers) should be transparent
— When new evidence from an RCT (analyzed via ANOVA) changes guideline-recommended therapy, ensure outpatient follow-up systems convey changes to discharging providers; failure to update care plans post-discharge is a recognized patient safety gap
Step 3 management: If a stem describes researchers performing unplanned post-hoc subgroup ANOVAs after a non-significant primary outcome and concluding efficacy in a subgroup, the correct critique is that post-hoc findings are hypothesis-generating, not confirmatory, and should not drive clinical practice without prospective replication — a direct patient safety concern.

Board pearl: The single highest-yield trigger phrase for one-way ANOVA on Step 3 is "compared mean [continuous variable] across [3 or more] groups." For repeated-measures, it's "same patients measured at multiple timepoints." Pattern-match these and 80% of ANOVA stems are answered.

— Stem: RCT, 4 antihypertensive arms, primary outcome = mean SBP change at 12 weeks
— Answer: One-way ANOVA
— Distractors: t-test (only 2 groups), chi-square (categorical), Kruskal-Wallis (non-parametric — only if outcome is ordinal/non-normal)
— Stem: Same 50 asthmatics' FEV1 measured at baseline, week 4, week 8, week 12 on new bronchodilator
— Answer: Repeated-measures ANOVA
— Distractor: one-way ANOVA (wrong — violates independence)
— Stem: Researchers ran 6 t-tests comparing 4 treatment arms pairwise
— Answer: Inflated family-wise type I error; should use one-way ANOVA with post-hoc correction
— Stem: F(3, 96) = 5.21, p = 0.002; researcher concludes drug A is best
— Answer: Conclusion premature; omnibus only indicates ≥1 group differs; need post-hoc (Tukey) to identify which
— Stem: Does drug efficacy differ by sex?
— Answer: Two-way ANOVA with interaction term
— Stem: Pain scores (1–10) across 3 analgesic groups, n=15 each, skewed distribution
— Answer: Kruskal-Wallis, not ANOVA
— Stem: Mauchly's test p = 0.01 in repeated-measures study
— Answer: Apply Greenhouse-Geisser correction
— Stem: p = 0.04, η² = 0.01 with N = 5000
— Answer: Statistically significant but clinically trivial; minimal variance explained
— Stem: 12 clinics randomized to 3 interventions; analyzed with one-way ANOVA on patient outcomes
— Answer: Ignores within-cluster correlation; use mixed-effects model
Step 3 management: Read every biostats stem with a checklist: (1) outcome type, (2) number of groups, (3) independent vs paired, (4) assumptions verified, (5) appropriate post-hoc. This 5-step scan resolves nearly all ANOVA stems.

ANOVA compares means of ≥3 groups on a continuous outcome via the F-ratio of between-group to within-group variance; choose one-way for independent groups, repeated-measures for the same subjects measured over time, and always follow a significant omnibus F with a post-hoc test (Tukey, Bonferroni, Dunnett) while verifying assumptions of normality, homogeneity of variance, and — for repeated measures — sphericity.
— One-way ANOVA = 1 factor, ≥3 independent groups, continuous outcome → F-test → if significant, Tukey HSD post-hoc
— Repeated-measures ANOVA = same subjects, ≥3 timepoints/conditions; check sphericity (Mauchly's), correct with Greenhouse-Geisser if violated
— Non-parametric analogs: Kruskal-Wallis (one-way), Friedman (repeated-measures) — use when data are ordinal, skewed, or small N
— Avoid multiple t-tests across ≥3 groups (inflates family-wise type I error to ~26% with 6 comparisons); ANOVA preserves overall α at 0.05
— Always pair p-value with effect size (η², Cohen's f) and confidence intervals — statistical significance ≠ clinical significance
— Escalate to two-way ANOVA for interactions, ANCOVA for covariate adjustment, MANOVA for multiple correlated outcomes, mixed-effects models for clustered or missing-data longitudinal designs
Board pearl: The Step 3 ANOVA question reduces to three structural questions: (1) Is the outcome continuous? (2) How many groups? (3) Independent or repeated? Answer those and you've answered the stem — pattern recognition trumps computation every time on test day.

