Biostatistics & Population Health
Parametric vs nonparametric tests: when to use each
— Continuous outcome (BP, LDL, HbA1c, FEV1, length of stay if log-normal-transformed)
— Sample size reasonably large (n ≥ 30 per group invokes the Central Limit Theorem, making the sampling distribution of the mean approximately normal even if raw data are not)
— Approximately symmetric distribution, few outliers, similar variances between groups
— Ordinal outcome (Likert pain scale, NYHA class, tumor stage, Apgar)
— Small sample (n < 15–30 per group) with skewed or unknown distribution
— Heavy outliers, censored data, or strongly skewed continuous variables (e.g., length of stay, hospital charges, cytokine levels)
— Data clearly non-normal on Shapiro–Wilk or visible right/left skew
— Parametric tests are more statistically powerful when assumptions are met (smaller n to detect the same effect)
— Nonparametric tests are more robust but lose ~5–15% power vs the equivalent parametric test when normality holds
Board pearl: A trial reporting median (IQR) instead of mean (SD) is signaling skewed data — expect a nonparametric test (Mann–Whitney/Wilcoxon) in the analysis. Conversely, mean ± SD with a normal histogram points to a t-test or ANOVA. Recognizing this descriptive-statistic clue often answers the test-selection question on its own.

— Continuous (height, BP, glucose, ejection fraction): candidates for t-test, ANOVA, Pearson r, linear regression — OR their nonparametric counterparts if assumptions fail
— Ordinal (pain 0–10, Glasgow Coma Scale, satisfaction Likert): default to nonparametric
— Nominal/categorical (alive/dead, MI yes/no, blood type): use chi-square or Fisher exact — these are categorical tests, NOT in the parametric/nonparametric dichotomy
— 1 sample vs known value: one-sample t-test (parametric) or one-sample Wilcoxon signed-rank
— 2 independent groups: independent t-test vs Mann–Whitney U (Wilcoxon rank-sum)
— 2 paired/matched: paired t-test vs Wilcoxon signed-rank
— ≥3 independent groups: one-way ANOVA vs Kruskal–Wallis
— ≥3 paired/repeated: repeated-measures ANOVA vs Friedman
— Look for vignette phrases: "skewed," "non-normally distributed," "median reported," "small pilot study," "highly variable" → nonparametric
— "Approximately normal," "n = 200," "mean ± SD" → parametric
Key distinction: Paired vs independent changes the test even within the same family. Paired = same subjects measured twice (pre/post drug, right vs left eye, twin studies, matched case-control). Independent = different subjects in each group. Missing this turns a paired t-test question into a wrong-answer independent t-test selection — a classic Step 3 biostats trap. Always scan the stem for "before and after," "same patients," or matched-pair language before choosing.

— Histogram — look for bell shape vs skew, bimodality, or floor/ceiling effects
— Q-Q plot — points hugging the diagonal line = normal; systematic curvature = non-normal
— Box plot — long whiskers, extreme outliers, or asymmetric box (median not centered) flag skew
— Shapiro–Wilk (preferred for n < 50, sometimes up to n = 2000)
— Kolmogorov–Smirnov (with Lilliefors correction)
— A non-significant p-value (p > 0.05) means you fail to reject normality → parametric OK
— Caveat: with very large n, even trivial deviations become "significant" — rely on plots too
— Levene's test or Bartlett's test
— If unequal variances → use Welch's t-test (an adjusted parametric test) rather than jumping to nonparametric
— Stable (normal, equal variance, no outliers) → parametric
— Unstable (skewed, outliers, heteroscedastic) → transform (log, square-root) and recheck, OR go nonparametric
Board pearl: Log-transformation often "rescues" right-skewed biomarker data (CRP, troponin, viral load, triglycerides, hospital cost) into a near-normal distribution, allowing a t-test on log-values. Reporting geometric means is the giveaway that a log-transform was performed. If transformation fails or the variable is inherently ordinal, commit to nonparametric — don't force a parametric model onto rank data.

— Parametric: Independent (two-sample) t-test — compares means; assumes normality + equal variance
— Nonparametric: Mann–Whitney U test (a.k.a. Wilcoxon rank-sum) — compares distributions/medians via ranks
— Parametric: Paired t-test — analyzes mean of within-subject differences
— Nonparametric: Wilcoxon signed-rank test — ranks the absolute differences and sums signed ranks
— Parametric: One-way ANOVA (F-test), followed by post-hoc Tukey/Bonferroni if significant
— Nonparametric: Kruskal–Wallis H test, followed by Dunn's post-hoc
— Parametric: Repeated-measures ANOVA
— Nonparametric: Friedman test
— Parametric: Pearson correlation (r) — assumes linear, normal, no extreme outliers
— Nonparametric: Spearman rank correlation (ρ) — uses ranks; works with ordinal data or monotonic nonlinear relationships
— Parametric continuous outcome: linear regression
— Robust alternatives: quantile (median) regression, or transform the outcome
Step 3 management: When a vignette reports "satisfaction scores were compared between two clinics" using a Likert scale — the correct test is Mann–Whitney U, not a t-test, because Likert is ordinal. Recognize ordinal scales (NYHA, ASA class, mRS, Apgar, pain VAS) as automatic nonparametric triggers regardless of sample size.

— Chi-square test of independence — 2×2 or larger contingency tables; requires expected cell counts ≥ 5
— Fisher exact test — preferred when any expected cell count < 5 or sample is small (e.g., rare adverse events)
— McNemar test — the "paired chi-square" for matched binary data (e.g., same patients before/after intervention, sensitivity comparison between two diagnostic tests on the same patients)
— Cochran–Mantel–Haenszel — stratified analysis controlling for a confounder
— Log-rank test — compares Kaplan–Meier curves between groups (nonparametric)
— Cox proportional hazards regression — semiparametric; yields hazard ratios
— Welch's t-test — relaxes equal-variance assumption; default in many modern statistical packages
— Bootstrap/permutation tests — resampling-based; distribution-free, increasingly used in modern trials
— Generalized estimating equations (GEE) or mixed models — clustered/longitudinal data
— Bonferroni (conservative), Holm, Tukey HSD (after ANOVA), Dunn (after Kruskal–Wallis), Benjamini–Hochberg (false discovery rate, for genomics/screening)
Key distinction: McNemar vs chi-square — McNemar is for paired binary data (the same patient classified twice), while chi-square independence is for independent binary observations. A study comparing pre- and post-intervention smoking rates in the same cohort uses McNemar; comparing smoking rates between two different towns uses chi-square. Mixing these up is a classic Step 3 distractor.

— Step 1: What is the outcome variable type?
— Step 2: How many groups, and are they paired?
— Step 3: Are parametric assumptions met?
— Using a t-test on heavily skewed small-sample data inflates Type I error (false positives) because the test statistic's reference distribution is no longer valid
— Using nonparametric when parametric is appropriate sacrifices power → higher Type II error (false negatives, missed effects) and inflated sample-size requirements
— Nonparametric tests typically require ~5–15% more subjects for equivalent power when normality actually holds
— This matters in NIH grant review and FDA submission contexts that occasionally appear in Step 3 research vignettes
Board pearl: When a study uses a pilot design with n = 10 per arm and reports a t-test p-value, suspect inappropriate methodology — small samples cannot reliably establish normality, so Mann–Whitney U is the safer choice. Examiners love to flag this with a follow-up question about which test the investigators should have used.

— Null hypothesis: μ₁ = μ₂ (means equal)
— Test statistic: t = (x̄₁ − x̄₂) / SE of difference
— Degrees of freedom ≈ n₁ + n₂ − 2 (Student) or Welch–Satterthwaite (Welch)
— Assumptions: independent samples, approximately normal in each group, equal variances (relaxed in Welch's)
— Output: mean difference + 95% CI + p-value
— Clinical example: comparing mean LDL reduction between atorvastatin 40 mg vs rosuvastatin 20 mg in a 6-month RCT with n = 200/arm
— Null hypothesis: the two distributions are identical (often interpreted as equal medians if shapes are similar)
— Procedure: pool all observations, rank them, sum ranks in each group, compute U statistic
— Assumptions: independent samples, ordinal or continuous data, similar distribution shapes (for median interpretation)
— Output: median difference (Hodges–Lehmann estimator) + 95% CI + p-value
— Clinical example: comparing pain scores (0–10 VAS) between gabapentin vs placebo in postherpetic neuralgia; or comparing length of stay (highly right-skewed) between two surgical techniques in a small cohort
— Paired t-test analyzes the mean of within-pair differences
— Wilcoxon signed-rank ranks absolute differences and applies the sign — useful for pre/post designs with skewed change scores
Step 3 management: A vignette comparing HbA1c before and after initiating an SGLT2 inhibitor in 50 patients with approximately normal change scores uses the paired t-test. The same design in 12 patients with skewed change scores uses the Wilcoxon signed-rank test. Pairing + small n + skew → signed-rank, every time.

— Tests whether at least one group mean differs
— F statistic = between-group variance / within-group variance
— Significant ANOVA requires post-hoc pairwise testing (Tukey HSD, Bonferroni-corrected t-tests)
— Assumptions: normality within each group, homoscedasticity (Levene's), independence
— Example: comparing mean systolic BP across three antihypertensive arms (ACEi vs ARB vs thiazide)
— Rank-based analog of one-way ANOVA
— Post-hoc: Dunn's test with Bonferroni adjustment
— Example: comparing NYHA class distributions across three heart-failure treatment arms
— Repeated-measures ANOVA: same patients measured at 3+ timepoints (baseline, 3 mo, 6 mo); requires sphericity (Mauchly's test) — if violated, apply Greenhouse–Geisser correction
— Friedman: ranks within each subject across timepoints; use for ordinal or non-normal repeated measures
— Pearson r — linear association between two continuous, normally distributed variables; ranges −1 to +1; r² = proportion of variance explained
— Spearman ρ — rank-based; captures monotonic (not just linear) relationships; robust to outliers; appropriate for ordinal data
— Example: BMI vs HbA1c → Pearson (if normal); tumor stage (I–IV) vs survival months → Spearman
Board pearl: Significant ANOVA without post-hoc is incomplete — ANOVA tells you a difference exists somewhere but not where. If the answer choices include "ANOVA followed by Tukey HSD," that combination is usually correct over a bare "ANOVA." Performing multiple t-tests instead of ANOVA inflates Type I error (family-wise α) — a frequent wrong-answer trap.

— Central Limit Theorem no longer guarantees normal sampling distribution of the mean
— Must verify normality of raw data via Shapiro–Wilk + Q-Q plot
— If normality fails or cannot be assessed reliably → use nonparametric
— Exact versions of Mann–Whitney, Wilcoxon signed-rank, and Fisher exact are designed for small n
— Hospital length of stay (right-skewed, floor at 0)
— Healthcare costs/charges (extreme right tail)
— Triglycerides, CRP, hs-troponin, D-dimer, viral loads, cytokine levels
— Time-to-event data (use survival methods, not t-tests)
— Reaction times, wait times in ED throughput studies
— Outliers act like comorbidities — they exert leverage on means but not on medians/ranks
— Nonparametric tests are inherently outlier-resistant because they use ranks, not raw values
— A single extreme value can flip a t-test from significant to nonsignificant; the Mann–Whitney result barely budges
— NYHA, CCS angina class, mRS (modified Rankin), ECOG performance status, ASA physical status, Apgar, GCS, Likert satisfaction, VAS pain (debated — often treated as continuous if large n)
Step 3 management: When asked the best summary statistic for hospital length of stay in a quality-improvement project, choose median (IQR) over mean (SD). Reporting a mean LOS of "7.4 days" when one patient stayed 180 days misleads stakeholders; the median (e.g., 4 days) reflects typical patient experience and matches the appropriate nonparametric analytical approach.

— Small available populations → routinely small n → nonparametric tests are often the default
— Examples: pediatric oncology trials (n = 20–40), orphan-drug studies, single-center surgical series
— Bayesian methods are increasingly used alongside nonparametric frequentist tests in these contexts
— Designed to estimate variability, refine procedures, not to confirm efficacy
— n typically 10–30 — too small for reliable normality testing
— Default to nonparametric for hypothesis-generating analyses; report effect sizes + CIs rather than emphasizing p-values
— N-of-1 trials (often used in pediatric ADHD, refractory epilepsy): paired analyses, usually Wilcoxon signed-rank
— Twin studies, matched case-control: McNemar (binary), Wilcoxon signed-rank (continuous/ordinal)
— Highly regulated, often small datasets — nonparametric and exact tests dominate
— Time-to-event endpoints (time to conception, gestational age at delivery) → survival analysis with log-rank, not t-test
— Observations within a cluster are not independent → standard t-test/ANOVA violates the independence assumption
— Use mixed-effects models or GEE; ignoring clustering inflates Type I error
Key distinction: A pilot study with n = 15 comparing a new pediatric anti-epileptic to placebo, reporting median seizure frequency, almost certainly used Mann–Whitney U (or Wilcoxon signed-rank if crossover). If the answer choices include a two-sample t-test, that is the distractor — small pediatric samples with count/skewed outcomes belong to the nonparametric world.

— Applying a t-test to severely skewed, small-sample data → actual α may exceed nominal 0.05
— Running multiple pairwise t-tests instead of ANOVA → family-wise α balloons (e.g., 5 comparisons → ~23% chance of at least one false positive)
— Failing to correct for multiple comparisons in genomic, biomarker, or subgroup analyses
— Using a nonparametric test when parametric assumptions are met loses 5–15% power
— Underpowered studies (small n, large variance) miss real differences
— Dichotomizing a continuous variable (e.g., BP into "high/normal") wastes information and lowers power
— Mann–Whitney U does not strictly compare medians — it compares the probability that a random value from group 1 exceeds one from group 2 (stochastic dominance)
— Median comparison interpretation requires similar distribution shapes
— Reporting "mean rank" without effect size is uninformative
— Treating paired data as independent loses statistical efficiency and may bias estimates
— Always check for repeated measurements, twins, matched pairs, or pre/post designs
— Heteroscedasticity → use Welch's correction or transform
— Severe outliers → investigate (data entry error?), transform, or use robust/nonparametric methods
— Non-independent observations (clustered/longitudinal) → use mixed models or GEE
Board pearl: A study comparing 5 treatment arms using 10 pairwise t-tests rather than ANOVA with post-hoc has a family-wise Type I error of ~40% (1 − 0.95¹⁰). The Step 3 examiner expects you to flag this as inflated false-positive risk and recommend either ANOVA + Tukey HSD or Bonferroni adjustment (α/10 = 0.005 per test).

— Multiple predictors → multivariable regression (linear, logistic, Cox)
— Clustered or longitudinal data → mixed-effects models, GEE
— Complex survival analysis with competing risks
— Adaptive trial designs, Bayesian frameworks, interim analyses
— Missing data > 5–10% → multiple imputation rather than complete-case analysis
— Propensity score matching for observational comparative effectiveness
— Single primary outcome, two groups, randomized → t-test or Mann–Whitney
— Pre/post within-subject change → paired t or Wilcoxon signed-rank
— Categorical 2×2 outcome → chi-square or Fisher exact
— Confounders that require adjustment → regression
— Repeated measurements over time → mixed model or repeated-measures ANOVA
— Time-to-event with censoring → Kaplan–Meier + log-rank, Cox
— Diagnostic test evaluation → ROC, sensitivity/specificity, AUC comparison (DeLong test)
— CONSORT for RCTs, STROBE for observational, PRISMA for systematic reviews — all require explicit statement of statistical methods, assumptions checked, and handling of missing data
— Pre-registration of analysis plan (clinicaltrials.gov, OSF) prevents p-hacking and selective reporting
Step 3 management: When the vignette describes a multicenter RCT with adjustment for baseline characteristics and a time-to-event primary endpoint, the correct analytical approach is Cox proportional hazards regression, not a t-test or Mann–Whitney U. Recognizing when you've moved beyond two-group bivariate comparison into multivariable territory is itself testable. Don't force a simple test onto a complex design.

— Two independent groups, ordinal or skewed continuous outcome
— Example: pain VAS in two parallel arms of an RCT
— One sample vs reference value, OR two paired/matched measurements
— Example: pre/post change in symptom score in the same patients
— ≥3 independent groups
— Example: tumor response (CR/PR/SD/PD as ordinal) across four chemotherapy regimens
— ≥3 paired/repeated measurements in the same subjects
— Example: pain score at baseline, 1 month, 3 months, 6 months in the same cohort
— Monotonic association between two ordinal or non-normal continuous variables
— Example: tumor grade (I–IV) vs survival in months
— Time-to-event outcome between groups (Kaplan–Meier survival curves)
— Cruder paired alternative when even the magnitude of differences cannot be assumed meaningful — only direction matters
Key distinction: Wilcoxon rank-sum (= Mann–Whitney U) is for independent groups, while Wilcoxon signed-rank is for paired data. They share the Wilcoxon name and are constantly confused. Mnemonic: "Rank-Sum = Separate samples; Signed-Rank = Same subjects." Picking signed-rank for two independent groups (or vice versa) is one of the most common biostats errors on Step 3.

— Chi-square / Fisher exact — categorical outcomes (NOT a parametric vs nonparametric choice)
— McNemar — paired binary outcomes
— Log-rank / Cox — time-to-event
— Logistic regression — binary outcome with multiple predictors
Board pearl: When the question asks about comparing proportions (e.g., 30-day mortality 12% vs 18%) between two groups, the answer is chi-square (or Fisher exact if small expected counts), not a t-test. Proportions are categorical summaries — they don't enter the parametric/nonparametric continuous-data dichotomy. Recognizing the outcome as a proportion vs a mean immediately reroutes you to the correct test family.

— Recognize the test used in any abstract or paper you read
— Verify that descriptive statistics (mean/SD vs median/IQR) match the inferential test family
— Check whether assumptions were stated and verified — most rigorous journals now require this
— Outcome variable type and distribution
— Normality assessment (Shapiro–Wilk, Q-Q plots)
— Test choice justified (parametric, nonparametric, regression)
— Multiple comparisons handled (Bonferroni, FDR)
— Missing data strategy
— Sample size/power calculation
— Translate mean differences into clinically meaningful terms (NNT, absolute risk reduction)
— Acknowledge that statistical significance ≠ clinical significance
— A trivial 1 mmHg systolic difference can be "p < 0.001" in a 50,000-patient trial — meaningless clinically
— Use median (IQR) for length-of-stay, charges, wait times
— Use Mann–Whitney for between-clinic comparisons of these skewed outcomes
— Use control charts (SPC) for ongoing monitoring rather than repeated hypothesis tests
— Parametric: Cohen's d, mean difference with 95% CI
— Nonparametric: Hodges–Lehmann median difference, rank-biserial correlation
Step 3 management: A QI project tracking ED door-to-balloon times before/after a process change should report median (IQR) and use a Wilcoxon signed-rank or Mann–Whitney test (depending on paired vs unpaired design), not a t-test. Time-based throughput metrics are nearly always right-skewed, and reporting means misrepresents typical performance to hospital leadership.

— When reading a paper, predict the test before checking — calibrate your intuition
— Track which test types appear in your specialty's literature (cardiology favors Cox regression and Kaplan–Meier; surgery favors chi-square and Mann–Whitney; pharmacology favors ANOVA and mixed models)
— JAMA Guide to Statistics and Methods series
— BMJ Endgames — Statistical Question features (excellent Step 3 prep)
— Coursera/edX biostatistics modules
— R, Python, Stata — even basic familiarity helps interpret published code/methods
— Pre-specified primary outcome and primary analysis
— Interim analyses with stopping boundaries (O'Brien–Fleming, Pocock)
— Data Safety Monitoring Board oversight for ongoing trials
— Intent-to-treat vs per-protocol analyses
— Emphasize that test choice flows from data type, design, and assumptions — not from "which test do I know"
— Encourage early biostatistician consultation in research design, not after data collection
— Stress that transformation is a legitimate first response to skew before abandoning parametric methods
— Re-analyze with the appropriate test and transparently report both
— Many "p < 0.05 by t-test, p = 0.08 by Mann–Whitney" results indicate fragile findings that may not replicate
— Robustness checks (re-running with alternative tests) strengthen the credibility of conclusions
Board pearl: If parametric and nonparametric tests give discordant results on the same dataset, the finding is likely fragile — driven by outliers, skew, or small n. Reporting both analyses with appropriate hedging is the rigorous response; cherry-picking the favorable p-value is p-hacking and a research integrity violation.

— P-hacking — running multiple tests until one reaches p < 0.05 and reporting only that one; violates scientific honesty and biases the literature
— HARKing (Hypothesizing After Results are Known) — presenting post-hoc findings as if they were pre-specified
— Selective outcome reporting — defining primary endpoint after seeing data
— Remedy: pre-registration of trials (ClinicalTrials.gov, mandated by ICMJE for journal publication) and analysis plans
— Subjects must understand what statistical analyses will be performed and how their data contribute
— Genomic and biomarker studies require explicit consent for secondary analyses and data sharing
— IRB approval verifies statistical methods are appropriate to the question being asked
— A trial using the wrong test may report a false-positive efficacy signal → patients exposed to ineffective or harmful therapy
— A trial using an underpowered nonparametric test may miss a true benefit → patients denied effective care
— Example: early underpowered trials of bevacizumab in metastatic breast cancer; accelerated approval later withdrawn after rigorous larger trials
— Industry-sponsored trials with flexible statistical plans require independent review
— Disclose all analyses performed, not only those reaching significance
— When discharging a patient on a new medication, ensure the supporting trial's analysis is valid for your patient (external validity)
— Subgroup analyses from large trials are often nonparametric or post-hoc — interpret cautiously before applying to elderly, pediatric, or comorbid patients underrepresented in the original sample
Step 3 management: If a sponsor pressures an investigator to "switch from Mann–Whitney to t-test because the t-test p-value is lower," this is research misconduct. The correct response is to refuse, report the request to the IRB, and document the original pre-specified analysis plan. Patient safety and scientific integrity supersede sponsor preference.

Board pearl: The single most efficient Step 3 biostats heuristic — read the outcome variable description first, then count the groups and check if paired, then scan for "normal" vs "skewed/median" language. Three reads, three branches, correct test in under 20 seconds.

— "Investigators compared NYHA class between sacubitril/valsartan and enalapril arms..."
— Correct: Mann–Whitney U. Distractor: independent t-test (wrong — NYHA is ordinal)
— "The same 40 patients had HbA1c measured before and after 6 months of semaglutide..."
— Correct: paired t-test if normal change scores; Wilcoxon signed-rank if skewed. Distractor: independent t-test (wrong — same patients)
— "In a pilot study of 12 patients, hospital length of stay was compared..."
— Correct: Mann–Whitney U. Distractor: t-test (wrong — small n, skewed LOS)
— "Mean systolic BP was compared across three antihypertensive arms..."
— Correct: one-way ANOVA followed by Tukey HSD. Distractor: multiple pairwise t-tests (inflates Type I error)
— "30-day mortality (binary) was compared between PCI and CABG..."
— Correct: chi-square (or Fisher exact if small cells). Distractor: t-test (wrong — binary outcome)
— "Sensitivity of two diagnostic tests applied to the same 200 patients..."
— Correct: McNemar test. Distractor: chi-square (wrong — paired)
— "Median survival was compared between two chemotherapy regimens with censoring..."
— Correct: log-rank test (and/or Cox regression for adjusted HR). Distractor: t-test on survival times (wrong — ignores censoring)
— "The association between tumor grade (I–IV) and survival (months) was assessed..."
— Correct: Spearman correlation. Distractor: Pearson r (wrong — grade is ordinal)
Key distinction: Always reread the stem for three keywords: (1) the outcome variable (continuous? ordinal? binary? time-to-event?), (2) the number and pairing of groups, and (3) any distributional clues ("median," "skewed," "small sample"). These three items deterministically select the test in nearly every Step 3 biostats question.

Choose a parametric test (t-test, ANOVA, Pearson) when the outcome is continuous, approximately normally distributed, and sample size supports it; choose the rank-based nonparametric equivalent (Mann–Whitney, Wilcoxon signed-rank, Kruskal–Wallis, Friedman, Spearman) when the outcome is ordinal, skewed, small-sample, or contains outliers — and recognize that categorical and time-to-event data have their own families (chi-square/Fisher/McNemar, log-rank/Cox).
Board pearl: Three-step Step 3 algorithm — (1) identify outcome type, (2) count groups + check pairing, (3) check distribution/sample size — solves nearly every biostats test-selection question in under 30 seconds.

