Biostatistics & Population Health

Parametric vs nonparametric tests: when to use each

Clinical Overview and When to Suspect a Parametric vs Nonparametric Approach

— Continuous outcome (BP, LDL, HbA1c, FEV1, length of stay if log-normal-transformed)

— Sample size reasonably large (n ≥ 30 per group invokes the Central Limit Theorem, making the sampling distribution of the mean approximately normal even if raw data are not)

— Approximately symmetric distribution, few outliers, similar variances between groups

— Ordinal outcome (Likert pain scale, NYHA class, tumor stage, Apgar)

— Small sample (n < 15–30 per group) with skewed or unknown distribution

— Heavy outliers, censored data, or strongly skewed continuous variables (e.g., length of stay, hospital charges, cytokine levels)

— Data clearly non-normal on Shapiro–Wilk or visible right/left skew

— Parametric tests are more statistically powerful when assumptions are met (smaller n to detect the same effect)

— Nonparametric tests are more robust but lose ~5–15% power vs the equivalent parametric test when normality holds

Board pearl: A trial reporting median (IQR) instead of mean (SD) is signaling skewed data — expect a nonparametric test (Mann–Whitney/Wilcoxon) in the analysis. Conversely, mean ± SD with a normal histogram points to a t-test or ANOVA. Recognizing this descriptive-statistic clue often answers the test-selection question on its own.

Core concept: Parametric tests assume the underlying data distribution follows a known form (usually normal/Gaussian) and analyze means; nonparametric tests make no distributional assumption and typically analyze medians or ranks.

Why Step 3 cares: You will be asked to interpret a journal article, a QI project, or a research vignette and pick the correct statistical test — choosing the wrong family invalidates the p-value.

When to suspect a parametric test is appropriate:

When to suspect nonparametric is appropriate:

Tradeoff:

Presentation Patterns and Key History — Recognizing Data Types in a Vignette

— Continuous (height, BP, glucose, ejection fraction): candidates for t-test, ANOVA, Pearson r, linear regression — OR their nonparametric counterparts if assumptions fail

— Ordinal (pain 0–10, Glasgow Coma Scale, satisfaction Likert): default to nonparametric

— Nominal/categorical (alive/dead, MI yes/no, blood type): use chi-square or Fisher exact — these are categorical tests, NOT in the parametric/nonparametric dichotomy

— 1 sample vs known value: one-sample t-test (parametric) or one-sample Wilcoxon signed-rank

— 2 independent groups: independent t-test vs Mann–Whitney U (Wilcoxon rank-sum)

— 2 paired/matched: paired t-test vs Wilcoxon signed-rank

— ≥3 independent groups: one-way ANOVA vs Kruskal–Wallis

— ≥3 paired/repeated: repeated-measures ANOVA vs Friedman

— Look for vignette phrases: "skewed," "non-normally distributed," "median reported," "small pilot study," "highly variable" → nonparametric

— "Approximately normal," "n = 200," "mean ± SD" → parametric

Key distinction: Paired vs independent changes the test even within the same family. Paired = same subjects measured twice (pre/post drug, right vs left eye, twin studies, matched case-control). Independent = different subjects in each group. Missing this turns a paired t-test question into a wrong-answer independent t-test selection — a classic Step 3 biostats trap. Always scan the stem for "before and after," "same patients," or matched-pair language before choosing.

Step 1: Identify the outcome variable type — this is the single most important decision branch.

Step 2: Count the groups

Step 3: Check distribution and sample size

Physical Exam Findings — Distributional Diagnostics Before Picking a Test

— Histogram — look for bell shape vs skew, bimodality, or floor/ceiling effects

— Q-Q plot — points hugging the diagonal line = normal; systematic curvature = non-normal

— Box plot — long whiskers, extreme outliers, or asymmetric box (median not centered) flag skew

— Shapiro–Wilk (preferred for n < 50, sometimes up to n = 2000)

— Kolmogorov–Smirnov (with Lilliefors correction)

— A non-significant p-value (p > 0.05) means you fail to reject normality → parametric OK

— Caveat: with very large n, even trivial deviations become "significant" — rely on plots too

— Levene's test or Bartlett's test

— If unequal variances → use Welch's t-test (an adjusted parametric test) rather than jumping to nonparametric

— Stable (normal, equal variance, no outliers) → parametric

— Unstable (skewed, outliers, heteroscedastic) → transform (log, square-root) and recheck, OR go nonparametric

Board pearl: Log-transformation often "rescues" right-skewed biomarker data (CRP, troponin, viral load, triglycerides, hospital cost) into a near-normal distribution, allowing a t-test on log-values. Reporting geometric means is the giveaway that a log-transform was performed. If transformation fails or the variable is inherently ordinal, commit to nonparametric — don't force a parametric model onto rank data.

Before applying any parametric test, you must "examine" the data — analogous to a physical exam.

Inspect the distribution:

Formal normality tests:

Variance homogeneity (for t-tests/ANOVA):

Hemodynamic analogy — assess the data's "stability":

Diagnostic Workup — The Core Parametric/Nonparametric Test Pairings

— Parametric: Independent (two-sample) t-test — compares means; assumes normality + equal variance

— Nonparametric: Mann–Whitney U test (a.k.a. Wilcoxon rank-sum) — compares distributions/medians via ranks

— Parametric: Paired t-test — analyzes mean of within-subject differences

— Nonparametric: Wilcoxon signed-rank test — ranks the absolute differences and sums signed ranks

— Parametric: One-way ANOVA (F-test), followed by post-hoc Tukey/Bonferroni if significant

— Nonparametric: Kruskal–Wallis H test, followed by Dunn's post-hoc

— Parametric: Repeated-measures ANOVA

— Nonparametric: Friedman test

— Parametric: Pearson correlation (r) — assumes linear, normal, no extreme outliers

— Nonparametric: Spearman rank correlation (ρ) — uses ranks; works with ordinal data or monotonic nonlinear relationships

— Parametric continuous outcome: linear regression

— Robust alternatives: quantile (median) regression, or transform the outcome

Step 3 management: When a vignette reports "satisfaction scores were compared between two clinics" using a Likert scale — the correct test is Mann–Whitney U, not a t-test, because Likert is ordinal. Recognize ordinal scales (NYHA, ASA class, mRS, Apgar, pain VAS) as automatic nonparametric triggers regardless of sample size.

Memorize these one-to-one mappings — they are the highest-yield biostats facts on Step 3:

Two independent groups, continuous outcome:

Two paired/matched measurements:

Three or more independent groups:

Three or more repeated/paired measures:

Correlation between two continuous variables:

Regression:

Diagnostic Workup — Advanced Test Selection and Categorical Comparisons

— Chi-square test of independence — 2×2 or larger contingency tables; requires expected cell counts ≥ 5

— Fisher exact test — preferred when any expected cell count < 5 or sample is small (e.g., rare adverse events)

— McNemar test — the "paired chi-square" for matched binary data (e.g., same patients before/after intervention, sensitivity comparison between two diagnostic tests on the same patients)

— Cochran–Mantel–Haenszel — stratified analysis controlling for a confounder

— Log-rank test — compares Kaplan–Meier curves between groups (nonparametric)

— Cox proportional hazards regression — semiparametric; yields hazard ratios

— Welch's t-test — relaxes equal-variance assumption; default in many modern statistical packages

— Bootstrap/permutation tests — resampling-based; distribution-free, increasingly used in modern trials

— Generalized estimating equations (GEE) or mixed models — clustered/longitudinal data

— Bonferroni (conservative), Holm, Tukey HSD (after ANOVA), Dunn (after Kruskal–Wallis), Benjamini–Hochberg (false discovery rate, for genomics/screening)

Key distinction: McNemar vs chi-square — McNemar is for paired binary data (the same patient classified twice), while chi-square independence is for independent binary observations. A study comparing pre- and post-intervention smoking rates in the same cohort uses McNemar; comparing smoking rates between two different towns uses chi-square. Mixing these up is a classic Step 3 distractor.

Categorical (nominal) outcomes sit outside the parametric/nonparametric dichotomy but appear in the same question stems:

Time-to-event (survival) data:

When parametric assumptions are borderline:

Multiple comparisons correction (applies to both families):

Risk Stratification — Choosing Between Parametric and Nonparametric in Practice

— Step 1: What is the outcome variable type?

— Step 2: How many groups, and are they paired?

— Step 3: Are parametric assumptions met?

— Using a t-test on heavily skewed small-sample data inflates Type I error (false positives) because the test statistic's reference distribution is no longer valid

— Using nonparametric when parametric is appropriate sacrifices power → higher Type II error (false negatives, missed effects) and inflated sample-size requirements

— Nonparametric tests typically require ~5–15% more subjects for equivalent power when normality actually holds

— This matters in NIH grant review and FDA submission contexts that occasionally appear in Step 3 research vignettes

Board pearl: When a study uses a pilot design with n = 10 per arm and reports a t-test p-value, suspect inappropriate methodology — small samples cannot reliably establish normality, so Mann–Whitney U is the safer choice. Examiners love to flag this with a follow-up question about which test the investigators should have used.

Build a decision tree before reading answer choices:

Continuous → proceed to Step 2

Ordinal → nonparametric

Nominal/binary → chi-square family (not parametric/nonparametric)

Time-to-event → log-rank/Cox

Map to the 2×2 grid in Chunk 4

Normality (Shapiro–Wilk, Q-Q plot, histogram)

Equal variance (Levene's)

Independence of observations

Sample size ≥ 30 per group (CLT cushion)

If all yes → parametric; if any fail → transform, then nonparametric

Risk of choosing wrong:

Sample size implications:

Pharmacotherapy — The Two-Sample t-Test and Mann–Whitney U in Depth

— Null hypothesis: μ₁ = μ₂ (means equal)

— Test statistic: t = (x̄₁ − x̄₂) / SE of difference

— Degrees of freedom ≈ n₁ + n₂ − 2 (Student) or Welch–Satterthwaite (Welch)

— Assumptions: independent samples, approximately normal in each group, equal variances (relaxed in Welch's)

— Output: mean difference + 95% CI + p-value

— Clinical example: comparing mean LDL reduction between atorvastatin 40 mg vs rosuvastatin 20 mg in a 6-month RCT with n = 200/arm

— Null hypothesis: the two distributions are identical (often interpreted as equal medians if shapes are similar)

— Procedure: pool all observations, rank them, sum ranks in each group, compute U statistic

— Assumptions: independent samples, ordinal or continuous data, similar distribution shapes (for median interpretation)

— Output: median difference (Hodges–Lehmann estimator) + 95% CI + p-value

— Clinical example: comparing pain scores (0–10 VAS) between gabapentin vs placebo in postherpetic neuralgia; or comparing length of stay (highly right-skewed) between two surgical techniques in a small cohort

— Paired t-test analyzes the mean of within-pair differences

— Wilcoxon signed-rank ranks absolute differences and applies the sign — useful for pre/post designs with skewed change scores

Step 3 management: A vignette comparing HbA1c before and after initiating an SGLT2 inhibitor in 50 patients with approximately normal change scores uses the paired t-test. The same design in 12 patients with skewed change scores uses the Wilcoxon signed-rank test. Pairing + small n + skew → signed-rank, every time.

Independent two-sample t-test:

Mann–Whitney U / Wilcoxon rank-sum:

Paired versions:

Procedures / Multi-Group and Correlation Analyses

— Tests whether at least one group mean differs

— F statistic = between-group variance / within-group variance

— Significant ANOVA requires post-hoc pairwise testing (Tukey HSD, Bonferroni-corrected t-tests)

— Assumptions: normality within each group, homoscedasticity (Levene's), independence

— Example: comparing mean systolic BP across three antihypertensive arms (ACEi vs ARB vs thiazide)

— Rank-based analog of one-way ANOVA

— Post-hoc: Dunn's test with Bonferroni adjustment

— Example: comparing NYHA class distributions across three heart-failure treatment arms

— Repeated-measures ANOVA: same patients measured at 3+ timepoints (baseline, 3 mo, 6 mo); requires sphericity (Mauchly's test) — if violated, apply Greenhouse–Geisser correction

— Friedman: ranks within each subject across timepoints; use for ordinal or non-normal repeated measures

— Pearson r — linear association between two continuous, normally distributed variables; ranges −1 to +1; r² = proportion of variance explained

— Spearman ρ — rank-based; captures monotonic (not just linear) relationships; robust to outliers; appropriate for ordinal data

— Example: BMI vs HbA1c → Pearson (if normal); tumor stage (I–IV) vs survival months → Spearman

Board pearl: Significant ANOVA without post-hoc is incomplete — ANOVA tells you a difference exists somewhere but not where. If the answer choices include "ANOVA followed by Tukey HSD," that combination is usually correct over a bare "ANOVA." Performing multiple t-tests instead of ANOVA inflates Type I error (family-wise α) — a frequent wrong-answer trap.

One-way ANOVA (parametric, ≥3 groups):

Kruskal–Wallis (nonparametric, ≥3 groups):

Repeated-measures ANOVA vs Friedman:

Correlation:

Special Populations — Small Samples and Skewed Biomedical Data

— Central Limit Theorem no longer guarantees normal sampling distribution of the mean

— Must verify normality of raw data via Shapiro–Wilk + Q-Q plot

— If normality fails or cannot be assessed reliably → use nonparametric

— Exact versions of Mann–Whitney, Wilcoxon signed-rank, and Fisher exact are designed for small n

— Hospital length of stay (right-skewed, floor at 0)

— Healthcare costs/charges (extreme right tail)

— Triglycerides, CRP, hs-troponin, D-dimer, viral loads, cytokine levels

— Time-to-event data (use survival methods, not t-tests)

— Reaction times, wait times in ED throughput studies

— Outliers act like comorbidities — they exert leverage on means but not on medians/ranks

— Nonparametric tests are inherently outlier-resistant because they use ranks, not raw values

— A single extreme value can flip a t-test from significant to nonsignificant; the Mann–Whitney result barely budges

— NYHA, CCS angina class, mRS (modified Rankin), ECOG performance status, ASA physical status, Apgar, GCS, Likert satisfaction, VAS pain (debated — often treated as continuous if large n)

Step 3 management: When asked the best summary statistic for hospital length of stay in a quality-improvement project, choose median (IQR) over mean (SD). Reporting a mean LOS of "7.4 days" when one patient stayed 180 days misleads stakeholders; the median (e.g., 4 days) reflects typical patient experience and matches the appropriate nonparametric analytical approach.

Small samples (n < 30 per group):

Inherently skewed biomedical variables — usually require log-transform or nonparametric:

Renal/hepatic-impairment analogy (data "comorbidities"):

Ordinal clinical scales — always nonparametric:

Special Populations — Pediatrics, Rare-Disease, and Pilot Studies

— Small available populations → routinely small n → nonparametric tests are often the default

— Examples: pediatric oncology trials (n = 20–40), orphan-drug studies, single-center surgical series

— Bayesian methods are increasingly used alongside nonparametric frequentist tests in these contexts

— Designed to estimate variability, refine procedures, not to confirm efficacy

— n typically 10–30 — too small for reliable normality testing

— Default to nonparametric for hypothesis-generating analyses; report effect sizes + CIs rather than emphasizing p-values

— N-of-1 trials (often used in pediatric ADHD, refractory epilepsy): paired analyses, usually Wilcoxon signed-rank

— Twin studies, matched case-control: McNemar (binary), Wilcoxon signed-rank (continuous/ordinal)

— Highly regulated, often small datasets — nonparametric and exact tests dominate

— Time-to-event endpoints (time to conception, gestational age at delivery) → survival analysis with log-rank, not t-test

— Observations within a cluster are not independent → standard t-test/ANOVA violates the independence assumption

— Use mixed-effects models or GEE; ignoring clustering inflates Type I error

Key distinction: A pilot study with n = 15 comparing a new pediatric anti-epileptic to placebo, reporting median seizure frequency, almost certainly used Mann–Whitney U (or Wilcoxon signed-rank if crossover). If the answer choices include a two-sample t-test, that is the distractor — small pediatric samples with count/skewed outcomes belong to the nonparametric world.

Pediatric and rare-disease research:

Pilot and feasibility studies:

Crossover and matched designs in special populations:

Pregnancy and reproductive research:

Cluster-randomized trials (school-, clinic-, or village-level randomization):

Complications — Misapplication and Its Consequences

— Applying a t-test to severely skewed, small-sample data → actual α may exceed nominal 0.05

— Running multiple pairwise t-tests instead of ANOVA → family-wise α balloons (e.g., 5 comparisons → ~23% chance of at least one false positive)

— Failing to correct for multiple comparisons in genomic, biomarker, or subgroup analyses

— Using a nonparametric test when parametric assumptions are met loses 5–15% power

— Underpowered studies (small n, large variance) miss real differences

— Dichotomizing a continuous variable (e.g., BP into "high/normal") wastes information and lowers power

— Mann–Whitney U does not strictly compare medians — it compares the probability that a random value from group 1 exceeds one from group 2 (stochastic dominance)

— Median comparison interpretation requires similar distribution shapes

— Reporting "mean rank" without effect size is uninformative

— Treating paired data as independent loses statistical efficiency and may bias estimates

— Always check for repeated measurements, twins, matched pairs, or pre/post designs

— Heteroscedasticity → use Welch's correction or transform

— Severe outliers → investigate (data entry error?), transform, or use robust/nonparametric methods

— Non-independent observations (clustered/longitudinal) → use mixed models or GEE

Board pearl: A study comparing 5 treatment arms using 10 pairwise t-tests rather than ANOVA with post-hoc has a family-wise Type I error of ~40% (1 − 0.95¹⁰). The Step 3 examiner expects you to flag this as inflated false-positive risk and recommend either ANOVA + Tukey HSD or Bonferroni adjustment (α/10 = 0.005 per test).

Type I error inflation (false positives):

Type II error inflation (false negatives, missed effects):

Misinterpreting nonparametric results:

Ignoring paired structure:

Distributional assumption violations not addressed:

When to Escalate — Consulting a Biostatistician and Advanced Methods

— Multiple predictors → multivariable regression (linear, logistic, Cox)

— Clustered or longitudinal data → mixed-effects models, GEE

— Complex survival analysis with competing risks

— Adaptive trial designs, Bayesian frameworks, interim analyses

— Missing data > 5–10% → multiple imputation rather than complete-case analysis

— Propensity score matching for observational comparative effectiveness

— Single primary outcome, two groups, randomized → t-test or Mann–Whitney

— Pre/post within-subject change → paired t or Wilcoxon signed-rank

— Categorical 2×2 outcome → chi-square or Fisher exact

— Confounders that require adjustment → regression

— Repeated measurements over time → mixed model or repeated-measures ANOVA

— Time-to-event with censoring → Kaplan–Meier + log-rank, Cox

— Diagnostic test evaluation → ROC, sensitivity/specificity, AUC comparison (DeLong test)

— CONSORT for RCTs, STROBE for observational, PRISMA for systematic reviews — all require explicit statement of statistical methods, assumptions checked, and handling of missing data

— Pre-registration of analysis plan (clinicaltrials.gov, OSF) prevents p-hacking and selective reporting

Step 3 management: When the vignette describes a multicenter RCT with adjustment for baseline characteristics and a time-to-event primary endpoint, the correct analytical approach is Cox proportional hazards regression, not a t-test or Mann–Whitney U. Recognizing when you've moved beyond two-group bivariate comparison into multivariable territory is itself testable. Don't force a simple test onto a complex design.

Escalate to a biostatistician when:

When simple parametric/nonparametric is enough:

Red flags that you've outgrown a simple test:

Reporting standards:

Key Differentials — Choosing Among Nonparametric Tests

— Two independent groups, ordinal or skewed continuous outcome

— Example: pain VAS in two parallel arms of an RCT

— One sample vs reference value, OR two paired/matched measurements

— Example: pre/post change in symptom score in the same patients

— ≥3 independent groups

— Example: tumor response (CR/PR/SD/PD as ordinal) across four chemotherapy regimens

— ≥3 paired/repeated measurements in the same subjects

— Example: pain score at baseline, 1 month, 3 months, 6 months in the same cohort

— Monotonic association between two ordinal or non-normal continuous variables

— Example: tumor grade (I–IV) vs survival in months

— Time-to-event outcome between groups (Kaplan–Meier survival curves)

— Cruder paired alternative when even the magnitude of differences cannot be assumed meaningful — only direction matters

Key distinction: Wilcoxon rank-sum (= Mann–Whitney U) is for independent groups, while Wilcoxon signed-rank is for paired data. They share the Wilcoxon name and are constantly confused. Mnemonic: "Rank-Sum = Separate samples; Signed-Rank = Same subjects." Picking signed-rank for two independent groups (or vice versa) is one of the most common biostats errors on Step 3.

A common Step 3 trap is picking the wrong nonparametric test within the family:

Mann–Whitney U (Wilcoxon rank-sum):

Wilcoxon signed-rank:

Kruskal–Wallis:

Friedman:

Spearman correlation (ρ):

Log-rank test:

Sign test:

Key Differentials — Parametric Tests and Their Look-Alikes

— Chi-square / Fisher exact — categorical outcomes (NOT a parametric vs nonparametric choice)

— McNemar — paired binary outcomes

— Log-rank / Cox — time-to-event

— Logistic regression — binary outcome with multiple predictors

Board pearl: When the question asks about comparing proportions (e.g., 30-day mortality 12% vs 18%) between two groups, the answer is chi-square (or Fisher exact if small expected counts), not a t-test. Proportions are categorical summaries — they don't enter the parametric/nonparametric continuous-data dichotomy. Recognizing the outcome as a proportion vs a mean immediately reroutes you to the correct test family.

One-sample t-test: compares a sample mean to a known reference value (e.g., mean cholesterol in your clinic vs national average of 200 mg/dL).

Independent (two-sample) t-test: two unrelated groups, equal variances assumed (Student's) or not (Welch's).

Paired t-test: same subjects measured twice — analyzes the mean of within-subject differences, not the two raw means.

One-way ANOVA: ≥3 independent groups, one factor.

Two-way ANOVA: two factors (e.g., drug × dose, or treatment × sex) — also tests interaction.

ANCOVA: ANOVA with a continuous covariate adjustment (e.g., comparing post-treatment BP across groups while adjusting for baseline BP).

Repeated-measures ANOVA: same subjects across multiple timepoints or conditions.

Pearson correlation / linear regression: continuous-continuous linear relationships.

Look-alikes from other families that examiners use as distractors:

Z-test: compares means or proportions when population variance is known or n is very large — rarely the right answer on Step 3 unless explicitly stated (more common in epidemiology of proportions).

Secondary Prevention — Building Statistical Literacy for Practice

— Recognize the test used in any abstract or paper you read

— Verify that descriptive statistics (mean/SD vs median/IQR) match the inferential test family

— Check whether assumptions were stated and verified — most rigorous journals now require this

— Outcome variable type and distribution

— Normality assessment (Shapiro–Wilk, Q-Q plots)

— Test choice justified (parametric, nonparametric, regression)

— Multiple comparisons handled (Bonferroni, FDR)

— Missing data strategy

— Sample size/power calculation

— Translate mean differences into clinically meaningful terms (NNT, absolute risk reduction)

— Acknowledge that statistical significance ≠ clinical significance

— A trivial 1 mmHg systolic difference can be "p < 0.001" in a 50,000-patient trial — meaningless clinically

— Use median (IQR) for length-of-stay, charges, wait times

— Use Mann–Whitney for between-clinic comparisons of these skewed outcomes

— Use control charts (SPC) for ongoing monitoring rather than repeated hypothesis tests

— Parametric: Cohen's d, mean difference with 95% CI

— Nonparametric: Hodges–Lehmann median difference, rank-biserial correlation

Step 3 management: A QI project tracking ED door-to-balloon times before/after a process change should report median (IQR) and use a Wilcoxon signed-rank or Mann–Whitney test (depending on paired vs unpaired design), not a t-test. Time-based throughput metrics are nearly always right-skewed, and reporting means misrepresents typical performance to hospital leadership.

Long-term skills for evidence-based practice:

Reading a Methods section — checklist:

When discussing trial results with patients (Step 3 communication):

For QI and practice-based research:

Reporting effect sizes:

Follow-Up, Monitoring, and Continued Learning

— When reading a paper, predict the test before checking — calibrate your intuition

— Track which test types appear in your specialty's literature (cardiology favors Cox regression and Kaplan–Meier; surgery favors chi-square and Mann–Whitney; pharmacology favors ANOVA and mixed models)

— JAMA Guide to Statistics and Methods series

— BMJ Endgames — Statistical Question features (excellent Step 3 prep)

— Coursera/edX biostatistics modules

— R, Python, Stata — even basic familiarity helps interpret published code/methods

— Pre-specified primary outcome and primary analysis

— Interim analyses with stopping boundaries (O'Brien–Fleming, Pocock)

— Data Safety Monitoring Board oversight for ongoing trials

— Intent-to-treat vs per-protocol analyses

— Emphasize that test choice flows from data type, design, and assumptions — not from "which test do I know"

— Encourage early biostatistician consultation in research design, not after data collection

— Stress that transformation is a legitimate first response to skew before abandoning parametric methods

— Re-analyze with the appropriate test and transparently report both

— Many "p < 0.05 by t-test, p = 0.08 by Mann–Whitney" results indicate fragile findings that may not replicate

— Robustness checks (re-running with alternative tests) strengthen the credibility of conclusions

Board pearl: If parametric and nonparametric tests give discordant results on the same dataset, the finding is likely fragile — driven by outliers, skew, or small n. Reporting both analyses with appropriate hedging is the rigorous response; cherry-picking the favorable p-value is p-hacking and a research integrity violation.

Monitor your own statistical reasoning:

Free tools and resources:

Common monitoring parameters in trials:

Counseling colleagues/trainees:

Rehab — what to do if you ran the wrong test:

Ethical, Legal, and Patient Safety Considerations in Statistical Practice

— P-hacking — running multiple tests until one reaches p < 0.05 and reporting only that one; violates scientific honesty and biases the literature

— HARKing (Hypothesizing After Results are Known) — presenting post-hoc findings as if they were pre-specified

— Selective outcome reporting — defining primary endpoint after seeing data

— Remedy: pre-registration of trials (ClinicalTrials.gov, mandated by ICMJE for journal publication) and analysis plans

— Subjects must understand what statistical analyses will be performed and how their data contribute

— Genomic and biomarker studies require explicit consent for secondary analyses and data sharing

— IRB approval verifies statistical methods are appropriate to the question being asked

— A trial using the wrong test may report a false-positive efficacy signal → patients exposed to ineffective or harmful therapy

— A trial using an underpowered nonparametric test may miss a true benefit → patients denied effective care

— Example: early underpowered trials of bevacizumab in metastatic breast cancer; accelerated approval later withdrawn after rigorous larger trials

— Industry-sponsored trials with flexible statistical plans require independent review

— Disclose all analyses performed, not only those reaching significance

— When discharging a patient on a new medication, ensure the supporting trial's analysis is valid for your patient (external validity)

— Subgroup analyses from large trials are often nonparametric or post-hoc — interpret cautiously before applying to elderly, pediatric, or comorbid patients underrepresented in the original sample

Step 3 management: If a sponsor pressures an investigator to "switch from Mann–Whitney to t-test because the t-test p-value is lower," this is research misconduct. The correct response is to refuse, report the request to the IRB, and document the original pre-specified analysis plan. Patient safety and scientific integrity supersede sponsor preference.

Research integrity issues:

Informed consent for research participation:

Patient safety — clinical decisions based on misapplied statistics:

Conflicts of interest:

Transition-of-care risk — translating evidence to practice:

High-Yield Associations and Rapid-Fire Clinical Facts

Board pearl: The single most efficient Step 3 biostats heuristic — read the outcome variable description first, then count the groups and check if paired, then scan for "normal" vs "skewed/median" language. Three reads, three branches, correct test in under 20 seconds.

Independent t-test ↔ Mann–Whitney U (= Wilcoxon rank-sum)

Paired t-test ↔ Wilcoxon signed-rank

One-way ANOVA ↔ Kruskal–Wallis

Repeated-measures ANOVA ↔ Friedman

Pearson correlation ↔ Spearman correlation

Chi-square ↔ Fisher exact (small expected cell counts)

Chi-square ↔ McNemar (paired binary)

Ordinal data triggers nonparametric: NYHA, Apgar, GCS, mRS, ECOG, ASA, Likert, pain VAS, tumor stage

Inherently skewed (consider log-transform or nonparametric): length of stay, costs, CRP, troponin, triglycerides, viral load, cytokines, reaction times

Sample size rule of thumb: n ≥ 30 per group invokes CLT, often justifying t-test even for non-normal raw data (if not too extreme)

Welch's t-test: robust to unequal variances; modern default

Bonferroni correction: α / number of comparisons; conservative but simple

Tukey HSD: post-hoc for ANOVA — controls family-wise error rate

Dunn's test: post-hoc for Kruskal–Wallis

Log-rank test: compares Kaplan–Meier curves; nonparametric survival comparison

Cox regression: hazard ratios, semiparametric (no assumption on baseline hazard shape, but assumes proportional hazards)

Effect sizes parametric: Cohen's d (small 0.2, medium 0.5, large 0.8); mean difference + 95% CI

Effect sizes nonparametric: Hodges–Lehmann median difference; rank-biserial correlation

Reporting tip: mean ± SD signals parametric analysis; median (IQR) signals nonparametric analysis — recognize the descriptive-inferential pairing in any abstract.

Board Question Stem Patterns

— "Investigators compared NYHA class between sacubitril/valsartan and enalapril arms..."

— Correct: Mann–Whitney U. Distractor: independent t-test (wrong — NYHA is ordinal)

— "The same 40 patients had HbA1c measured before and after 6 months of semaglutide..."

— Correct: paired t-test if normal change scores; Wilcoxon signed-rank if skewed. Distractor: independent t-test (wrong — same patients)

— "In a pilot study of 12 patients, hospital length of stay was compared..."

— Correct: Mann–Whitney U. Distractor: t-test (wrong — small n, skewed LOS)

— "Mean systolic BP was compared across three antihypertensive arms..."

— Correct: one-way ANOVA followed by Tukey HSD. Distractor: multiple pairwise t-tests (inflates Type I error)

— "30-day mortality (binary) was compared between PCI and CABG..."

— Correct: chi-square (or Fisher exact if small cells). Distractor: t-test (wrong — binary outcome)

— "Sensitivity of two diagnostic tests applied to the same 200 patients..."

— Correct: McNemar test. Distractor: chi-square (wrong — paired)

— "Median survival was compared between two chemotherapy regimens with censoring..."

— Correct: log-rank test (and/or Cox regression for adjusted HR). Distractor: t-test on survival times (wrong — ignores censoring)

— "The association between tumor grade (I–IV) and survival (months) was assessed..."

— Correct: Spearman correlation. Distractor: Pearson r (wrong — grade is ordinal)

Key distinction: Always reread the stem for three keywords: (1) the outcome variable (continuous? ordinal? binary? time-to-event?), (2) the number and pairing of groups, and (3) any distributional clues ("median," "skewed," "small sample"). These three items deterministically select the test in nearly every Step 3 biostats question.

Pattern 1 — Ordinal outcome trap:

Pattern 2 — Paired pre/post:

Pattern 3 — Skewed continuous, small n:

Pattern 4 — Three groups, normal data:

Pattern 5 — Categorical outcome:

Pattern 6 — Paired binary outcome:

Pattern 7 — Time-to-event:

Pattern 8 — Correlation:

One-Line Recap

Choose a parametric test (t-test, ANOVA, Pearson) when the outcome is continuous, approximately normally distributed, and sample size supports it; choose the rank-based nonparametric equivalent (Mann–Whitney, Wilcoxon signed-rank, Kruskal–Wallis, Friedman, Spearman) when the outcome is ordinal, skewed, small-sample, or contains outliers — and recognize that categorical and time-to-event data have their own families (chi-square/Fisher/McNemar, log-rank/Cox).

Board pearl: Three-step Step 3 algorithm — (1) identify outcome type, (2) count groups + check pairing, (3) check distribution/sample size — solves nearly every biostats test-selection question in under 30 seconds.

High-yield recap 1: Independent t-test ↔ Mann–Whitney U; Paired t-test ↔ Wilcoxon signed-rank; ANOVA ↔ Kruskal–Wallis; Repeated-measures ANOVA ↔ Friedman; Pearson r ↔ Spearman ρ. Memorize these five pairings cold.

High-yield recap 2: Ordinal scales (NYHA, Likert, mRS, ECOG, Apgar, GCS) are always nonparametric regardless of sample size; mean ± SD signals parametric, median (IQR) signals nonparametric — let descriptive statistics tip you off to the inferential family.

High-yield recap 3: Categorical outcomes use chi-square (or Fisher exact for small cells, McNemar for paired binary); time-to-event uses log-rank and Cox regression — these sit outside the parametric/nonparametric dichotomy but are tested in the same question set.

High-yield recap 4: Choosing the wrong test is a research integrity and patient safety issue — it can produce false-positive efficacy signals (harming patients with ineffective therapies) or false-negative results (denying patients real benefits). Pre-register your analysis plan, verify assumptions before testing, correct for multiple comparisons, and consult a biostatistician for any analysis beyond a simple two-group bivariate comparison.