Biostatistics & Population Health

ANOVA: one-way and repeated measures interpretation

Clinical Overview and When to Suspect ANOVA

— One-way ANOVA: 1 categorical independent variable (factor) with ≥3 levels, 1 continuous dependent variable

— Example: Mean HbA1c across patients on metformin vs sulfonylurea vs GLP-1 agonist vs placebo

— Repeated-measures ANOVA: same subjects measured ≥3 times or under ≥3 conditions (within-subject design)

— Example: Mean systolic BP at baseline, 6 weeks, 12 weeks in the same hypertensive cohort

— Each t-test at α=0.05 carries 5% type I error; doing 6 pairwise tests inflates family-wise error to ~26%

— ANOVA preserves overall α at 0.05 with a single omnibus F-test

— F = (between-group variance) / (within-group variance)

— Large F → groups differ more than random noise within groups → reject H₀ that all means are equal

Board pearl: If a stem says "researchers compared mean values across four treatment arms," the answer is one-way ANOVA, not multiple t-tests, not chi-square. Chi-square is for categorical outcomes; t-test is for 2 groups only. Recognizing the structure of the data (number of groups, type of outcome, independent vs paired) is the entire game for Step 3 biostats questions.

ANOVA (Analysis of Variance) compares means of 3 or more groups simultaneously on a continuous outcome, while controlling the family-wise type I error rate.

Suspect ANOVA when a study question asks: "Do these multiple groups differ on average?"

Why not run multiple t-tests?

Output is the F-statistic and a p-value:

ANOVA tells you at least one group differs, not which one — post-hoc tests (Tukey, Bonferroni, Scheffé) localize the difference

Continuous outcome examples on Step 3: blood pressure, ejection fraction, pain score, lab values, length of stay (if normally distributed)

Presentation Patterns and Key History

— "Three (or more) groups" + "mean," "average," or a continuous variable → ANOVA territory

— "Same patients followed over time" or "measured before, during, and after" → repeated-measures ANOVA

— "Independent groups" or "randomized to one of four arms" → one-way ANOVA

— RCT comparing mean LDL reduction across atorvastatin, rosuvastatin, simvastatin, and placebo (one-way)

— Pain scores recorded at 0, 30, 60, 90 minutes after analgesia in the same ED cohort (repeated-measures)

— Mean weight loss across 5 bariatric procedure types in different patient cohorts (one-way)

— Same COPD patients' FEV1 measured pre-bronchodilator, 1 hr post, and 24 hr post (repeated-measures)

— Number of groups stated explicitly ("four arms," "three diets")

— Outcome described with words like mean, average, change in, level of

— Phrase "same subjects" or "each participant served as their own control" = repeated measures

— Phrase "randomly assigned" to distinct cohorts = independent groups → one-way

— "Proportion who responded" → categorical → chi-square, not ANOVA

— "Survival time" → log-rank/Cox, not ANOVA

— "Two groups only" → t-test, not ANOVA

Key distinction: Independence of observations is what separates one-way from repeated-measures ANOVA. If a single patient contributes one data point → one-way. If a single patient contributes multiple data points across time/conditions → repeated-measures. Mis-applying one-way ANOVA to paired data violates independence assumptions and inflates type I error — a classic Step 3 "what's wrong with this analysis" stem.

Step 3 stems present ANOVA scenarios in stereotyped formats — learn to pattern-match the study design language:

Common clinical research framings:

Key history clues in the vignette:

Watch for distractor language:

Physical Exam Findings (Assumptions and Data Structure Assessment)

— Independence of observations across and within groups (no patient in two arms; no clustering)

— Normality of the outcome within each group (or large enough n per group via CLT, typically ≥30)

— Homogeneity of variance (homoscedasticity) — variances roughly equal across groups; tested via Levene's test or Bartlett's test

— Continuous (interval/ratio) dependent variable

— Categorical independent variable with ≥3 levels

— Sphericity — variances of the differences between all pairs of within-subject conditions are equal

— Tested via Mauchly's test; if violated, apply Greenhouse-Geisser or Huynh-Feldt correction to degrees of freedom

— Look at group sample sizes — balanced designs are robust; very unbalanced designs amplify variance violations

— Examine standard deviations — if largest SD is >2× smallest SD, homogeneity is suspect

— Plot residuals; check histograms for skew

— Severe non-normality with small n → Kruskal-Wallis (non-parametric one-way analog)

— Repeated measures with non-normal data → Friedman test (non-parametric repeated-measures analog)

— Unequal variances → Welch's ANOVA

Board pearl: A Step 3 stem describing pain scores (ordinal 0–10) across 3 groups with skewed distributions and n=12 per arm should NOT use one-way ANOVA — the correct answer is Kruskal-Wallis. Recognize when ordinal or non-normal data invalidate parametric ANOVA.

ANOVA's "exam findings" are its statistical assumptions — the things a reviewer or test-taker must verify before trusting an F-test result:

One-way ANOVA assumptions:

Repeated-measures ANOVA additional assumption:

Quick "exam" inspection:

When assumptions fail:

Diagnostic Workup — The F-Statistic and P-Value

— F = MS_between / MS_within

— MS_between (mean square between groups) = variance attributable to group membership

— MS_within (mean square within groups, aka error) = random variability among subjects in the same group

— Under H₀ (all means equal), F ≈ 1; under H₁, F >> 1

— df_between = k − 1 (k = number of groups)

— df_within = N − k (N = total sample size)

— Example: 4 groups, 20 patients each → df = (3, 76)

— p < 0.05 → reject H₀ → at least one group mean differs from at least one other

— p ≥ 0.05 → fail to reject; conclude insufficient evidence of any difference

— A significant F does not tell you which groups differ — that requires post-hoc testing

— Total SS = Between-group SS + Within-group SS (one-way)

— Repeated-measures: Total SS = Between-subjects SS + Within-subjects SS (treatment + error) — subject-level variance is removed from error, increasing power

— e.g., F(2, 87) = 6.42, p = 0.003

Step 3 management: When a vignette shows F(3, 96) = 4.81, p = 0.004, the correct next step is post-hoc pairwise comparisons (Tukey HSD most common) — not concluding which specific drug is best, and not running individual t-tests without correction. Recognize that the omnibus test is a screen; post-hoc is the confirmatory localization.

The core "lab test" of ANOVA is the F-ratio:

Degrees of freedom (df) determine the F-distribution shape:

Interpreting the omnibus p-value:

Sources of variance partitioned:

Reporting format on exams: F(df_between, df_within) = value, p = value

Diagnostic Workup — Post-Hoc Tests and Confirmatory Analysis

— Most commonly tested on Step 3

— Used when all pairwise comparisons are of interest

— Balances type I error and power well in balanced designs

— Divides α by number of comparisons (α/m)

— Conservative; reduces power; simple to compute

— Best when only a few pre-planned comparisons

— Most conservative; allows complex contrasts (e.g., average of groups A+B vs C)

— Lower power, but flexible

— Specifically for comparing multiple treatments against a single control

— Ideal in placebo-controlled multi-arm RCTs

— Step-down procedures, more powerful than straight Bonferroni

— η² (eta-squared) = SS_between / SS_total; proportion of variance explained

— Partial η² for repeated measures

— Cohen's f: small 0.10, medium 0.25, large 0.40

Key distinction: Bonferroni vs Tukey — Bonferroni is a generic multiple-comparison correction applicable to any set of tests; Tukey is specifically designed for all pairwise comparisons of group means after ANOVA and is typically more powerful for that exact purpose. On Step 3, if the stem says "researchers wanted to compare every group with every other group," Tukey is the cleanest answer.

After a significant omnibus ANOVA, post-hoc tests identify which group pairs differ while controlling family-wise error:

Tukey's HSD (Honestly Significant Difference)

Bonferroni correction

Scheffé's method

Dunnett's test

Holm-Bonferroni / FDR (Benjamini-Hochberg)

For repeated-measures ANOVA, post-hoc tests are paired comparisons (paired t-tests with Bonferroni or Tukey adjustment)

Effect size reporting complements significance:

Confidence intervals for mean differences provide direction and magnitude beyond p-values — increasingly emphasized in modern biostats reporting

Risk Stratification — Choosing the Right ANOVA Variant

— Yes → proceed with ANOVA family

— No (ordinal/skewed, small n) → Kruskal-Wallis (independent) or Friedman (repeated)

— One factor with ≥3 levels → one-way ANOVA

— Two factors (e.g., drug × sex) → two-way ANOVA, which also tests interaction effects

— Three+ factors → factorial ANOVA

— Independent → one-way or two-way ANOVA

— Same subjects across conditions/time → repeated-measures ANOVA

— Mix of between- and within-subject factors → mixed-design ANOVA (e.g., drug A vs B, measured at 3 timepoints)

— Add continuous covariates → ANCOVA (analysis of covariance); useful for adjusting for baseline values, age, BMI

— MANOVA (multivariate ANOVA) — e.g., simultaneous comparison of systolic BP, diastolic BP, and heart rate across groups

— Sample size per group should typically be ≥10–20 for ANOVA to be robust

— Repeated measures designs gain power by removing between-subject variance — fewer subjects needed than equivalent one-way design

Board pearl: A two-way ANOVA's most powerful feature is the interaction term. If a stem asks "does the effect of the drug differ by sex?", the answer requires two-way ANOVA with interaction, not two separate one-way ANOVAs. The interaction p-value answers the differential-effect question directly.

Decision algorithm for ANOVA selection on Step 3:

Step 1: Is the outcome continuous and approximately normal?

Step 2: How many independent variables (factors)?

Step 3: Are subjects independent across groups, or measured repeatedly?

Step 4: Are there covariates to control?

Step 5: Multiple correlated outcomes simultaneously?

Power considerations:

Pharmacotherapy — Computational Mechanics of One-Way ANOVA

— Captures how far each group mean deviates from the grand mean, weighted by group size

— Captures residual variability within each group

— MS_B = SS_B / (k − 1)

— MS_W = SS_W / (N − k)

— 3 drugs, 10 patients each, comparing mean LDL reduction

— Group means: 30, 45, 50 mg/dL; grand mean 41.67

— Suppose SS_B = 2333, SS_W = 1620

— MS_B = 2333/2 = 1166.5; MS_W = 1620/27 = 60

— F = 1166.5/60 = 19.4; df = (2, 27); p < 0.001 → reject H₀

Step 3 management: When given ANOVA output in a table, focus on three numbers: the F-statistic, the p-value, and the df pair. Confirm df_between = k−1 matches the number of groups described. A mismatch (e.g., stem says 4 drugs but df_between = 2) signals the analysis was misapplied or you're misreading the design.

Conceptual computation (Step 3 won't ask you to calculate by hand, but understanding the structure helps interpret output):

Grand mean (X̄_grand) = mean of all observations pooled

Group means (X̄_j) = mean within each group j

Sum of Squares Between (SS_B) = Σ n_j (X̄_j − X̄_grand)²

Sum of Squares Within (SS_W) = Σ Σ (x_ij − X̄_j)²

Total SS = SS_B + SS_W

Mean Squares (MS) = SS / df

F = MS_B / MS_W, compared to critical F at α with df (k−1, N−k)

Worked example:

Interpretation: at least one drug produces significantly different mean LDL reduction; Tukey HSD then identifies which drug pair(s) differ

A small F (<1) means within-group noise exceeds between-group signal — no evidence of true mean differences

Procedures — Repeated-Measures ANOVA Mechanics and Sphericity

— Total SS = Between-subjects SS + Within-subjects SS

— Within-subjects SS is further split into treatment/time SS and error SS

— Removing between-subject variability from the error term increases statistical power — a major advantage when subjects vary widely at baseline

— Variances of the differences between every pair of repeated conditions must be approximately equal

— Violations are common when measurements are closer in time (autocorrelation)

— Mauchly's test screens for sphericity; p < 0.05 indicates violation

— Greenhouse-Geisser ε — multiplies df by ε (more conservative)

— Huynh-Feldt ε — used when GG ε > 0.75 (less conservative)

— Both shrink df, raising the critical F threshold and protecting type I error

— DV: systolic BP; within-subject factor: time (4 levels)

— Output: F(3, 72) = 14.2, p < 0.001 → mean BP differs across at least one timepoint pair

— Post-hoc paired comparisons with Bonferroni identify which timepoints differ from baseline

CCS pearl: In longitudinal clinical research (post-MI ejection fraction at 1, 3, 6, 12 months in the same cohort), repeated-measures ANOVA — not separate one-way ANOVAs at each timepoint — is the correct framework. Separate analyses ignore within-subject correlation and waste power.

Repeated-measures ANOVA partitions variability differently than one-way:

Sphericity assumption is unique and crucial:

Corrections when sphericity is violated:

Alternative: MANOVA approach — treats the repeated measures as multiple correlated DVs and avoids sphericity entirely; preferred when sphericity is severely violated and sample size adequate

Example design: Same 25 hypertensive patients measured at 0, 4, 8, and 12 weeks on a new ACE inhibitor

Mixed-design caveat: when both between-subjects (e.g., drug A vs B) and within-subjects (e.g., time) factors exist, interpret main effects AND interaction term carefully

Special Populations — Small Samples and Non-Normal Data

— Central Limit Theorem cannot rescue normality assumption

— Shapiro-Wilk test on residuals helps confirm; visual inspection of Q-Q plots

— If non-normal → use Kruskal-Wallis test (rank-based, non-parametric one-way analog)

— Kruskal-Wallis tests whether medians/distributions differ across ≥3 independent groups; post-hoc via Dunn's test

— Use Friedman test (non-parametric repeated-measures analog)

— Post-hoc: Wilcoxon signed-rank pairwise comparisons with Bonferroni

— Welch's ANOVA adjusts degrees of freedom and is robust to variance inequality

— Post-hoc: Games-Howell test

— Type III sums of squares preferred over Type I in regression-based ANOVA frameworks

— Power skewed toward larger groups

— ANOVA F-test is sensitive to extreme values

— Investigate clinically (data entry error vs true biologic outlier)

— Sensitivity analyses excluding outliers can demonstrate robustness

— Strictly, ANOVA is inappropriate; non-parametric alternatives or ordinal regression are correct

— Many published studies use ANOVA anyway when scales have ≥7 levels and look continuous — common Step 3 critique

Key distinction: Kruskal-Wallis vs one-way ANOVA — same study question (≥3 group comparison), different data types. Kruskal-Wallis when outcome is ordinal, skewed, or n small. ANOVA when continuous, normal, and adequately powered. A stem describing "physician-rated symptom severity (mild/moderate/severe)" across 4 treatment arms should trigger Kruskal-Wallis, not ANOVA.

ANOVA is parametric and assumes normality; in small samples or skewed data, validity drops:

Small sample size (n < 10 per group):

Repeated-measures with non-normal/ordinal data:

Unequal variances (heteroscedasticity):

Unbalanced designs (very unequal group sizes):

Outliers:

Ordinal outcomes (e.g., pain VAS, Likert scales):

Special Populations — Clustered Data, Multi-Site Trials, and Hierarchical Designs

— Randomization at clinic/hospital level, not patient level (e.g., 10 clinics randomized to 1 of 3 quality-improvement strategies)

— Patients within a clinic are correlated → cannot treat each patient as independent

— Correct analysis: mixed-effects (multilevel) models or GEE with cluster as random effect; not naive one-way ANOVA

— Intra-cluster correlation must be accounted for

— Site can be a fixed or random effect; ignoring it inflates type I error if site effects exist

— Growth curves over time → mixed models or repeated-measures ANOVA preferred over cross-sectional one-way at each visit

— Pre/post-partum measurements on same women → repeated-measures framework

— Dropouts in long follow-up studies violate complete-case assumption of classical repeated-measures ANOVA

— Mixed-effects models handle missing-at-random data better and have largely replaced repeated-measures ANOVA in modern analyses

— Each patient receives multiple treatments in sequence → within-subject design → repeated-measures ANOVA or paired analyses

— Carryover and period effects must be assessed

Board pearl: When a Step 3 stem describes a cluster-randomized trial and analyzes outcomes with one-way ANOVA treating individual patients as independent units, the methodological flaw is ignoring within-cluster correlation, which falsely inflates effective sample size and increases type I error. The correct analysis uses mixed/multilevel models.

Standard one-way ANOVA assumes independent observations — violated in many real-world Step 3-relevant designs:

Cluster-randomized trials:

Family/twin studies, repeated organ measurements (left/right kidney):

Multi-site RCTs:

Pediatric longitudinal cohorts:

Pregnancy studies:

Elderly with missing data:

Crossover trials:

Complications — Misinterpretation and Statistical Errors

— Inflates family-wise type I error; 6 pairwise tests at α=0.05 → ~26% chance of at least one false positive

— Correct: omnibus ANOVA first, then post-hoc with correction

— Omnibus only says "at least one differs"; without post-hoc, you cannot identify which

— Without Greenhouse-Geisser correction when warranted → inflated false-positive rate

— Violates independence; correct analysis is repeated-measures ANOVA

— Large N can detect trivial mean differences (e.g., 1 mmHg BP across drugs) — report effect sizes (η², Cohen's f) and confidence intervals

— Running ANOVA on 20 different outcomes without correction → some will be "significant" by chance

— Use MANOVA or Bonferroni-style adjustment across outcomes

— Converting continuous data to categorical (e.g., HbA1c into controlled/uncontrolled) loses information and statistical power

— Running every possible subgroup ANOVA and reporting only significant ones = HARKing; misleading

— ANOVA across naturally occurring groups (e.g., smokers vs non-smokers vs ex-smokers) shows association, not causation

Step 3 management: When you see "researchers performed six independent t-tests to compare four treatment groups," the correct critique is family-wise type I error inflation, and the appropriate corrective analysis is one-way ANOVA followed by Tukey HSD (or equivalent). Recognize this pattern instantly.

Common errors that show up as Step 3 "what's wrong with this analysis" stems:

Running multiple t-tests instead of ANOVA:

Concluding "Drug A is best" from a significant omnibus F alone:

Ignoring sphericity in repeated measures:

Applying one-way ANOVA to paired data:

Confusing statistical significance with clinical significance:

Multiple outcome inflation:

Outcome dichotomization:

Post-hoc cherry-picking:

Assuming causation from observational ANOVA:

When to Escalate — Switching to More Advanced Models

— Example: comparing mean weight loss across 3 diets while adjusting for baseline weight

— Improves precision and reduces bias from baseline imbalances

— Tests main effects of each factor and their interaction

— Example: drug (3 levels) × sex (2 levels) on BP — does drug effect differ by sex?

— Example: simultaneous comparison of SBP, DBP, HR across 4 antihypertensives

— Controls error across outcomes; follow significant MANOVA with univariate ANOVAs

— Not ANOVA territory; censoring requires survival methods

— Patients nested in clinics, repeated measures with missing data, longitudinal trajectories

CCS pearl: When a study question involves longitudinal data with missing observations, patient-level random effects, or time-varying covariates, escalate from repeated-measures ANOVA to linear mixed-effects models (LMM). LMMs are the modern standard for longitudinal clinical research and are increasingly the "correct answer" on advanced biostats stems.

ANOVA is a workhorse but has limits — recognize when a more advanced model is required:

Add a continuous covariate → ANCOVA

Add a second categorical factor → Two-way (or factorial) ANOVA

Multiple correlated continuous outcomes → MANOVA

Time-to-event outcome → Survival analysis (Kaplan-Meier, log-rank, Cox regression)

Binary outcomes → Logistic regression / chi-square

Counts (e.g., # of ED visits) → Poisson or negative binomial regression

Clustered/hierarchical data → Mixed-effects models (multilevel modeling)

Non-linear time trajectories → Growth curve modeling

Need to model dose-response → Linear regression with continuous predictor, not ANOVA forcing dose into categories

Heteroscedastic or non-normal → Welch's ANOVA, robust ANOVA, or non-parametric analogs

Key Differentials — Other Tests for Comparing Means

— Compares a single sample mean to a known/hypothesized value (e.g., is mean HbA1c in clinic ≠ 7.0%?)

— Compares means of exactly 2 independent groups with continuous, normally distributed outcome

— If only 2 groups → t-test, NOT ANOVA (though one-way ANOVA with 2 groups is mathematically equivalent: F = t²)

— Compares means of 2 measurements in the same subjects (before/after)

— Repeated-measures ANOVA with 2 timepoints is equivalent

— Two-group comparison when variances are unequal

— ≥3 independent groups, continuous outcome

— ≥2 factors; tests main effects and interactions

— ≥3 measurements in same subjects

— Combines between-subject and within-subject factors

— ANOVA + continuous covariate adjustment

— Multiple continuous DVs simultaneously

— 1 group vs known value → 1-sample t

— 2 independent groups → independent t-test

— 2 paired measurements → paired t-test

— ≥3 independent groups → one-way ANOVA

— ≥3 paired/repeated measurements → repeated-measures ANOVA

Key distinction: Why not just do an ANOVA on 2 groups? Mathematically it works (F = t²), but t-test is conventional and conceptually clearer. Step 3 expects t-test for 2 groups and ANOVA for ≥3. Choosing the wrong category (e.g., ANOVA when only 2 groups exist) is a flagged error pattern.

Other tests in the "comparing means" family — know when each applies:

One-sample t-test:

Independent (two-sample) t-test:

Paired t-test:

Welch's t-test:

One-way ANOVA:

Two-way / factorial ANOVA:

Repeated-measures ANOVA:

Mixed-design ANOVA:

ANCOVA:

MANOVA:

Quick decision table:

Key Differentials — Non-Parametric and Categorical Alternatives

— Non-parametric one-way ANOVA analog

— Use for: ordinal outcomes, skewed continuous data, small samples

— Tests whether distributions/medians differ across ≥3 independent groups

— Post-hoc: Dunn's test with Bonferroni

— Non-parametric repeated-measures ANOVA analog

— ≥3 paired/repeated measurements with ordinal or non-normal data

— Post-hoc: pairwise Wilcoxon signed-rank with Bonferroni

— Non-parametric two-group comparison (analog of independent t-test)

— Non-parametric paired comparison (analog of paired t-test)

— Categorical outcome × categorical predictor; e.g., proportion cured across 3 treatment arms

— NOT for continuous outcomes

— Chi-square alternative when expected cell counts <5

— Paired categorical data (e.g., agreement between two diagnostic tests on same patients)

— Extension of McNemar to ≥3 paired binary outcomes

— Compares survival curves across ≥2 groups

— Models continuous outcome on continuous or categorical predictors; ANOVA is a special case of linear regression where predictors are categorical

Board pearl: "Proportion of patients achieving remission across 4 chemotherapy regimens" → chi-square, not ANOVA. ANOVA's outcome must be continuous. If the outcome is count of events, yes/no, cured/not cured, or survival time, you're outside ANOVA's domain. Distinguishing outcome type (continuous vs categorical vs time-to-event) is the single most testable biostats concept on Step 3.

When ANOVA assumptions fail or data type changes, use these alternatives:

Kruskal-Wallis test:

Friedman test:

Mann-Whitney U (Wilcoxon rank-sum):

Wilcoxon signed-rank:

Chi-square test of independence:

Fisher's exact test:

McNemar's test:

Cochran's Q test:

Log-rank test:

Linear regression:

Secondary Prevention — Reporting Standards and Best Practices

— Define hypotheses, primary outcome, and analysis plan before data collection

— Distinguish primary from secondary/exploratory analyses

— Prevents p-hacking and HARKing

— Report group means, SDs, and Ns before inferential tests

— Include sample size per group; flag any imbalances

— Normality (Shapiro-Wilk, Q-Q plots)

— Homogeneity of variance (Levene's)

— Sphericity (Mauchly's) for repeated measures

— State which assumption tests passed/failed and any corrections applied

— F(df_between, df_within) = value, p = value, η² = value

— Confidence intervals for mean differences

— Specify which method (Tukey, Bonferroni, etc.) and why

— Report adjusted p-values

— η², partial η², or Cohen's f

— Distinguishes statistically significant from clinically meaningful

— Run analyses with/without outliers, with/without missing data imputed

— Avoid publication bias by reporting all pre-specified outcomes

— Pre-study power calculation: typical target 80% power at α=0.05 to detect a clinically meaningful effect

Step 3 management: When evaluating a published study using ANOVA, scan for: (1) clear group definitions, (2) assumption testing, (3) F-statistic with full df, (4) post-hoc method named, (5) effect size reported, (6) clinical significance discussed. Missing any of these is a quality flag worth noting in critical-appraisal stems.

High-quality ANOVA reporting (CONSORT, STROBE-aligned) includes:

Pre-specification:

Descriptive statistics first:

Test assumptions and report:

Report complete F-statistic format:

Post-hoc tests:

Effect sizes:

Sensitivity analyses:

Transparent reporting of non-significant results:

Sample size justification:

Follow-Up — Power, Sample Size, and Effect Size Interpretation

— Power = 1 − β = probability of detecting a true effect; target ≥80%

— Inputs: α (typically 0.05), effect size (Cohen's f), number of groups, desired power → outputs required N per group

— Cohen's f benchmarks: 0.10 small, 0.25 medium, 0.40 large

— Rule of thumb: medium effect, 4 groups, 80% power → ~45 per group

— High within-subject correlation amplifies this advantage

— η² (eta-squared) = SS_between / SS_total

— 0.01 small, 0.06 medium, 0.14 large

— Partial η² = SS_effect / (SS_effect + SS_error); preferred in factorial and repeated-measures designs

— ω² (omega-squared) = less biased alternative to η², especially in small samples

— Width reflects precision; narrower with larger N

— CIs that exclude 0 align with p < 0.05 for that pairwise comparison

— Statistically significant (p < 0.05) ≠ clinically meaningful

— A 0.1% HbA1c reduction across 5000 patients may reach p < 0.001 but be clinically trivial

— Minimum clinically important difference (MCID) anchors interpretation

— Risk of type II error (false negative) and inflated effect estimates if significant ("winner's curse")

Board pearl: Always pair p-value interpretation with effect size and confidence intervals. A stem reporting F(2, 297) = 4.10, p = 0.018, η² = 0.027 shows statistical significance but small effect size (~2.7% of variance explained) — the test is not clinically practice-changing despite the p-value. Step 3 increasingly rewards this nuanced reading.

Power analysis for ANOVA designs:

Repeated-measures ANOVA gains power by reducing error variance — fewer subjects needed than equivalent one-way for the same effect

Effect size measures:

Confidence intervals on mean differences:

Clinical vs statistical significance:

Underpowered studies:

Post-hoc power calculations are generally discouraged (uninformative); pre-study power planning is the standard

Ethical, Legal, and Patient Safety Considerations

— Ethically, RCTs should pre-register hypotheses and primary outcomes (ClinicalTrials.gov)

— Post-hoc analyses cherry-picked from non-significant primary outcomes are misleading and threaten patient safety when they drive practice change

— Running multiple ANOVAs and selectively reporting significant ones inflates false-positive rates

— Constitutes research misconduct when intentional; misleads downstream clinicians

— Patients enrolling in 4-arm RCTs comparing treatments must understand they may receive placebo or older therapy

— Equipoise must exist among arms; if one is clearly superior pre-trial, randomization is unethical

— Pre-specified interim ANOVA analyses with stopping rules (e.g., O'Brien-Fleming boundaries) protect patients if early evidence shows harm or overwhelming benefit

— Unplanned peeks at data inflate type I error

— Should be pre-specified; otherwise risk spurious findings that could harm specific populations if practice changes inappropriately

— Negative and non-significant ANOVA results must be reported to avoid publication bias

— Selective reporting harms patient safety by skewing evidence base

— Industry-sponsored studies must disclose funding; analytic choices (e.g., post-hoc test selection, exclusion of outliers) should be transparent

— When new evidence from an RCT (analyzed via ANOVA) changes guideline-recommended therapy, ensure outpatient follow-up systems convey changes to discharging providers; failure to update care plans post-discharge is a recognized patient safety gap

Step 3 management: If a stem describes researchers performing unplanned post-hoc subgroup ANOVAs after a non-significant primary outcome and concluding efficacy in a subgroup, the correct critique is that post-hoc findings are hypothesis-generating, not confirmatory, and should not drive clinical practice without prospective replication — a direct patient safety concern.

Statistical methods carry ethical weight in clinical research — Step 3 tests recognition of these issues:

Pre-registration of analysis plans:

P-hacking and HARKing:

Informed consent in trials with multiple arms:

Data and Safety Monitoring Boards (DSMBs):

Subgroup analyses (interaction in two-way ANOVA):

Reporting transparency:

Conflicts of interest:

Transition-of-care application:

High-Yield Associations and Rapid-Fire Clinical Facts

Board pearl: The single highest-yield trigger phrase for one-way ANOVA on Step 3 is "compared mean [continuous variable] across [3 or more] groups." For repeated-measures, it's "same patients measured at multiple timepoints." Pattern-match these and 80% of ANOVA stems are answered.

ANOVA = compares ≥3 group means on continuous outcome

F-statistic = MS_between / MS_within; large F → reject H₀

Significant omnibus F → need post-hoc test to localize difference

Tukey HSD = most common post-hoc for all pairwise comparisons

Bonferroni = most conservative; α/m; few pre-planned comparisons

Dunnett's = treatments vs single control

Scheffé = complex contrasts; most flexible, lowest power

One-way ANOVA = 1 factor, independent groups

Two-way ANOVA = 2 factors; tests main effects + interaction

Repeated-measures ANOVA = same subjects, ≥3 timepoints/conditions

Mixed-design ANOVA = between + within factors combined

ANCOVA = ANOVA + continuous covariate adjustment

MANOVA = multiple correlated continuous outcomes

Sphericity = repeated-measures assumption; Mauchly's test screens

Greenhouse-Geisser / Huynh-Feldt = corrections when sphericity violated

Kruskal-Wallis = non-parametric one-way analog (ordinal/skewed data)

Friedman test = non-parametric repeated-measures analog

Welch's ANOVA = robust to unequal variances

Levene's test = checks homogeneity of variance

Family-wise type I error: multiple t-tests inflate to ~26% with 6 comparisons

Effect sizes: η² small 0.01 / medium 0.06 / large 0.14

Cohen's f: small 0.10 / medium 0.25 / large 0.40

ANOVA is a special case of linear regression with categorical predictors

F = t² when comparing 2 groups

Independence violations → use mixed-effects models

Continuous outcome required; categorical outcome → chi-square instead

Time-to-event → log-rank, not ANOVA

df_between = k − 1; df_within = N − k

Larger N → more power but doesn't fix design flaws

Statistical significance ≠ clinical significance — always check effect size + CI

Board Question Stem Patterns

— Stem: RCT, 4 antihypertensive arms, primary outcome = mean SBP change at 12 weeks

— Answer: One-way ANOVA

— Distractors: t-test (only 2 groups), chi-square (categorical), Kruskal-Wallis (non-parametric — only if outcome is ordinal/non-normal)

— Stem: Same 50 asthmatics' FEV1 measured at baseline, week 4, week 8, week 12 on new bronchodilator

— Answer: Repeated-measures ANOVA

— Distractor: one-way ANOVA (wrong — violates independence)

— Stem: Researchers ran 6 t-tests comparing 4 treatment arms pairwise

— Answer: Inflated family-wise type I error; should use one-way ANOVA with post-hoc correction

— Stem: F(3, 96) = 5.21, p = 0.002; researcher concludes drug A is best

— Answer: Conclusion premature; omnibus only indicates ≥1 group differs; need post-hoc (Tukey) to identify which

— Stem: Does drug efficacy differ by sex?

— Answer: Two-way ANOVA with interaction term

— Stem: Pain scores (1–10) across 3 analgesic groups, n=15 each, skewed distribution

— Answer: Kruskal-Wallis, not ANOVA

— Stem: Mauchly's test p = 0.01 in repeated-measures study

— Answer: Apply Greenhouse-Geisser correction

— Stem: p = 0.04, η² = 0.01 with N = 5000

— Answer: Statistically significant but clinically trivial; minimal variance explained

— Stem: 12 clinics randomized to 3 interventions; analyzed with one-way ANOVA on patient outcomes

— Answer: Ignores within-cluster correlation; use mixed-effects model

Step 3 management: Read every biostats stem with a checklist: (1) outcome type, (2) number of groups, (3) independent vs paired, (4) assumptions verified, (5) appropriate post-hoc. This 5-step scan resolves nearly all ANOVA stems.

Pattern 1 — "What is the most appropriate statistical test?"

Pattern 2 — Repeated measurements

Pattern 3 — "What's wrong with this analysis?"

Pattern 4 — Interpreting significant omnibus F

Pattern 5 — Two-factor design with interaction

Pattern 6 — Non-normal/ordinal outcome

Pattern 7 — Sphericity violation

Pattern 8 — Effect size vs p-value

Pattern 9 — Cluster-randomized design

One-Line Recap

ANOVA compares means of ≥3 groups on a continuous outcome via the F-ratio of between-group to within-group variance; choose one-way for independent groups, repeated-measures for the same subjects measured over time, and always follow a significant omnibus F with a post-hoc test (Tukey, Bonferroni, Dunnett) while verifying assumptions of normality, homogeneity of variance, and — for repeated measures — sphericity.

— One-way ANOVA = 1 factor, ≥3 independent groups, continuous outcome → F-test → if significant, Tukey HSD post-hoc

— Repeated-measures ANOVA = same subjects, ≥3 timepoints/conditions; check sphericity (Mauchly's), correct with Greenhouse-Geisser if violated

— Non-parametric analogs: Kruskal-Wallis (one-way), Friedman (repeated-measures) — use when data are ordinal, skewed, or small N

— Avoid multiple t-tests across ≥3 groups (inflates family-wise type I error to ~26% with 6 comparisons); ANOVA preserves overall α at 0.05

— Always pair p-value with effect size (η², Cohen's f) and confidence intervals — statistical significance ≠ clinical significance

— Escalate to two-way ANOVA for interactions, ANCOVA for covariate adjustment, MANOVA for multiple correlated outcomes, mixed-effects models for clustered or missing-data longitudinal designs

Board pearl: The Step 3 ANOVA question reduces to three structural questions: (1) Is the outcome continuous? (2) How many groups? (3) Independent or repeated? Answer those and you've answered the stem — pattern recognition trumps computation every time on test day.

High-yield recap bullets: