Biostatistics & Population Health
Mean, median, mode in skewed distributions
— Mean: arithmetic average; mathematically pulled toward extreme values (outliers)
— Median: the 50th percentile value; robust to outliers because it depends only on rank order
— Mode: the most frequently occurring value; reflects the peak of the distribution
— Order from low to high on the x-axis: mode < median < mean
— Mnemonic: "the mean chases the tail" — outliers on the right pull the mean rightward
— Classic examples: hospital length of stay, healthcare costs, income, serum CRP, viral load, waiting times, drug half-lives in poor metabolizers
— Order: mean < median < mode
— Examples: age at death in a developed country, gestational age at delivery, test scores when most students do well

— "Hospital length of stay," "time to event," "healthcare expenditures," "wait times in the ED"
— "A few patients stayed >30 days while most were discharged within 3"
— "Mean serum ferritin was 450 ng/mL but median was 80 ng/mL" — large mean-median gap with mean > median
— Counts of rare events, parasite burdens, hospital charges, CD4 counts in untreated HIV
— Age at death in industrialized populations (most die old, a few die young)
— Gestational age at delivery in a healthy cohort (most ~39–40 weeks, tail toward preterm)
— Performance on an easy exam (ceiling effect)
— Mean > median → right skew
— Mean < median → left skew
— Mean ≈ median → approximately symmetric (but not proof of normality — could be bimodal)
— Example: age distribution of Hodgkin lymphoma (peaks at 20s and 60s), Crohn's onset
— Mean and median may both sit in the valley between peaks, misrepresenting "typical" patient

— Right skew: tall bars on the left, progressively shorter bars trailing to the right; peak (mode) sits on the left side
— Left skew: tall bars on the right, tail trailing left; peak sits on the right
— Symmetric: mirror-image bars around a central peak (bell curve if normal)
— Box = interquartile range (IQR), Q1 to Q3
— Line inside box = median (Q2)
— Whiskers extend to 1.5×IQR or to min/max
— Dots beyond whiskers = outliers
— Right skew: median line sits closer to Q1 (lower edge of box); upper whisker longer; outliers on the high end
— Left skew: median sits closer to Q3; lower whisker longer; outliers on the low end
— Symmetric: median centered in box; whiskers equal length
— Points along the diagonal line = normal distribution
— Upward curve at the right end = right skew
— Downward curve at the left end = left skew
— Side-by-side boxplots of LOS for two hospital units: median comparison is fair even if both are right-skewed; comparing means may be distorted by a single outlier ICU stay
— Right skew: mean line sits to the right of the median line
— Left skew: mean line sits to the left of the median line

— Step 1: Is the variable continuous or categorical?
— Step 2: If continuous, is the distribution approximately symmetric or skewed?
— Step 3: Are there meaningful outliers?
— Examples: adult height, systolic BP in a healthy cohort, hemoglobin A1c in a screened population
— Examples: hospital LOS, healthcare cost per admission, time to ED triage, CRP, troponin, drug levels
— IQR (Q1–Q3) is the robust analog of standard deviation
— Examples: most common presenting symptom, predominant blood type, most frequent discharge diagnosis
— Mean and median are mathematically nonsensical for nominal data ("the mean blood type is B+" is meaningless)
— Examples: pain scale 0–10, NYHA class, Glasgow Coma Scale, satisfaction Likert scale
— Mean of ordinal data assumes equal spacing between categories, which is rarely true
— Dataset: 2, 3, 4, 5, 6 → mean=4, median=4
— Add an outlier: 2, 3, 4, 5, 6, 100 → mean=20, median=4.5
— The mean tripled; the median barely moved — this is the entire reason median is preferred for skewed data

| • Skewness coefficient (numerical measure): | ||
| — Skewness = 0 → symmetric | ||
| — Skewness > 0 → right-skewed (positive skew) | ||
| — Skewness < 0 → left-skewed (negative skew) | ||
| — | Skewness | > 1 generally considered substantial skew |
| • Pearson's approximation: Skew ≈ 3(mean − median)/SD | ||
| — Quickly estimates direction and magnitude from summary statistics | ||
| — Confirms the rule: mean > median → positive skew | ||
| • Kurtosis: describes tail heaviness, not central tendency | ||
| — Leptokurtic = heavy tails (more outliers than normal) | ||
| — Platykurtic = light tails | ||
| — Not directly asked on Step 3 but may appear as a distractor | ||
| • Log transformation — the workhorse for right-skewed biomedical data: | ||
| — Taking the natural log (ln) of each value often converts a right-skewed distribution into an approximately normal one | ||
| — Common targets: viral load, antibody titers, cytokine levels, hospital costs, time-to-event data, microbial colony counts | ||
| — After log-transforming, you may legitimately use parametric tests (t-test, ANOVA) and report the geometric mean | ||
| • Geometric mean: nth root of the product of n values; equivalent to the antilog of the mean of the logged values | ||
| — Reported for antibody titers (e.g., GMT in vaccine trials), pharmacokinetic parameters (Cmax, AUC) | ||
| — Always ≤ arithmetic mean for positive data | ||
| • Parametric vs nonparametric test choice: | ||
| — Symmetric/normal continuous data → t-test, ANOVA, Pearson correlation (use mean) | ||
| — Skewed or ordinal data → Wilcoxon rank-sum, Mann-Whitney U, Kruskal-Wallis, Spearman correlation (use median) | ||
| • Key distinction: Sample size affects this choice via the central limit theorem — the sampling distribution of the mean approaches normal as n grows, even for skewed underlying data. With n > ~30, t-tests on means become robust. But the underlying data are still skewed, so for describing a typical patient, median remains preferred even when inferential statistics can use the mean |

— Mean per-patient annual cost in a population with a few catastrophic cases can be 3–10× the median
— Policy decisions based on mean misallocate resources; median + distribution percentiles give a fairer picture
— High-cost users (top 5%) account for ~50% of US healthcare spending → classic extreme right skew
— Comparing hospitals by mean LOS penalizes tertiary centers that admit complex outliers
— Risk-adjusted median LOS is the fairer quality metric
— Half-life data are typically right-skewed (a few slow metabolizers)
— Using mean half-life can underestimate accumulation risk in the slow-metabolizer tail
— Many biomarkers (ferritin, ALT, TSH, CRP, IgE) are right-skewed in healthy populations
— Reference intervals use the central 95% (2.5th–97.5th percentile), not mean ± 2 SD, because the underlying data are skewed
— Survival times are typically right-skewed; median survival is the standard reported metric in oncology (e.g., "median OS 14 months")
— Mean survival is rarely meaningful because the tail of long survivors distorts it
— Door-to-balloon time, door-to-needle time, sepsis bundle compliance time — all right-skewed; report median and 90th percentile

— Sum all values, divide by n
— Sensitive to every value, especially extremes
— Formula: x̄ = Σxᵢ / n
— Step 1: Sort all values in ascending order
— Step 2: If n is odd, median = middle value (position (n+1)/2)
— Step 3: If n is even, median = average of two middle values (positions n/2 and n/2+1)
— Values: 2, 4, 4, 7, 9, 12, 100 (n=7)
— Sorted already; middle position = 4th value = 7
— Mean = 138/7 ≈ 19.7 → mean (19.7) >> median (7) → strongly right-skewed
— Values: 3, 5, 5, 8, 10, 12 (n=6)
— Middle positions = 3rd and 4th values = 5 and 8 → median = (5+8)/2 = 6.5
— Mean = 43/6 ≈ 7.17 → mean ≈ median → roughly symmetric
— Identify the most frequently occurring value(s)
— Unimodal: one mode (e.g., 2, 3, 3, 4, 5 → mode=3)
— Bimodal: two modes (e.g., 2, 3, 3, 5, 5, 7 → modes=3 and 5)
— No mode: all values appear equally often
— For continuous data, mode is the peak of the histogram (often estimated, not computed)
— Mean − median > 0 and large relative to SD → right skew
— Mean − median < 0 → left skew

— Independent t-test: compares means of 2 independent groups (e.g., BP in drug vs placebo)
— Paired t-test: compares means of paired measurements (pre/post)
— ANOVA: compares means across ≥3 groups
— Pearson correlation: linear association between two continuous normal variables
— Linear regression: predicts continuous outcome from predictors (assumes normal residuals)
— Wilcoxon rank-sum (Mann-Whitney U): nonparametric analog of independent t-test → compares medians/distributions
— Wilcoxon signed-rank: nonparametric analog of paired t-test
— Kruskal-Wallis: nonparametric ANOVA
— Spearman rank correlation: monotonic association, robust to skew
— Sign test: simplest paired comparison
— Continuous + symmetric/normal + n adequate → parametric
— Continuous + skewed + small n → nonparametric or log-transform then parametric
— Ordinal (Likert, NYHA, pain scale) → nonparametric
— Categorical → chi-square or Fisher exact (separate family)
— "Mean (SD)" → assumes normality
— "Median (IQR)" or "Median [Q1, Q3]" → signals skew or robust reporting
— "n (%)" → categorical
— Spotting these in Table 1 of a paper tells you what authors assumed about the data

— Statistical definition: value beyond 1.5×IQR from Q1 or Q3 (boxplot rule), or beyond 3 SD from mean (z-score rule, only valid if data are normal)
— Clinical outliers may be real biological extremes (e.g., genuine super-responder), measurement errors, or data-entry mistakes
— Never delete outliers solely because they are extreme — investigate first; document any exclusions transparently
— Mean: highly sensitive (a single extreme value can shift it dramatically)
— Median: robust (changes only if outlier crosses the middle position)
— Mode: completely unaffected unless the outlier is duplicated
— SD: highly sensitive; IQR: robust
— With n < ~15, normality is hard to verify; default to nonparametric methods or report the full data
— Mean and median can diverge substantially in small samples even from a truly symmetric population (sampling variability)
— Patients who haven't experienced the event by study end are "right-censored"
— Mean survival cannot be calculated until all patients have events
— Median survival is reportable as soon as 50% of patients have had the event — the standard oncology metric
— Skewed lab values (ferritin, TSH, IgE) often use percentile-based reference intervals rather than mean ± 2 SD
— Pediatric growth charts use percentiles (median, 5th, 95th) precisely because growth is mildly skewed and percentiles communicate clinical meaning
— Drug clearance in CKD/cirrhosis populations is often right-skewed with long tails of slow clearance
— Median clearance + IQR better guides starting doses than mean ± SD

— Height, weight, head circumference, BMI plotted on percentile curves — the median (50th percentile) is the reference
— "Failure to thrive" defined by percentile crossing, not deviation from mean
— Birth weight distribution is mildly left-skewed (tail of very-low-birth-weight infants)
— Strongly left-skewed: peak at 39–40 weeks, tail of preterm births
— Median > mean in this case
— Reporting mean GA misrepresents typical delivery
— Ordinal (0–10), typically left-skewed in healthy newborns (most score 8–10)
— Report median, not mean
— Household income is the textbook right-skewed distribution
— Median household income is the standard reported metric (US Census); mean would be inflated by billionaires
— Social determinants of health data (income, education years, housing cost burden) → report medians
— Parasite egg counts, viral loads, CD4 counts in untreated HIV — strongly right-skewed
— Geometric mean or median used in surveillance reports
— Antibody titers are log-normally distributed
— Geometric mean titer (GMT) is the standard reported metric — never arithmetic mean
— Seroconversion rates (proportions) reported separately
— Bimodal for several diseases (Hodgkin lymphoma, IBD, certain leukemias) — mean and median both misleading; report distribution or modes

— "Average household income in this neighborhood is $180,000" when median is $55,000 → masks economic reality, misguides resource allocation
— "Mean LOS = 12 days" when median = 4 → suggests the hospital is inefficient when actually a few outliers drive the number
— t-test on cost data with extreme right skew → inflated variance, reduced power, potentially false negatives
— Solution: log-transform or use Wilcoxon
— SD assumes symmetry; mean ± 2 SD on skewed data produces negative lower bounds for variables that can't be negative (LOS, cost)
— If your "mean − 2 SD" reference range goes below zero, the data are skewed
— Drug A mean LOS = 8 days, Drug B mean LOS = 12 days → looks like Drug A is better
— But Drug B's mean was driven by one patient who stayed 60 days due to unrelated complications
— Median comparison: Drug A = 4 days, Drug B = 4 days → no real difference
— Without justification, this is data manipulation
— Sensitivity analyses (results with and without outliers) are the ethical approach
— Mean survival is undefined when censored data exist; reporting it falsely implies all patients have died
— Mode of "presenting complaint" may be "headache," but the diagnostically critical complaint may be the less common "thunderclap onset"

— Highly skewed data that resist simple log transformation
— Multiple outliers without obvious data-entry explanation
— Censored data (survival analyses)
— Repeated measures with non-normal residuals
— Small sample sizes (n < 30) where CLT cannot be relied upon
— Mixed-distribution data (e.g., zero-inflated cost data)
— IRB review for any research using patient data
— Pre-specified analysis plan in protocols to prevent p-hacking via post-hoc choice of mean vs median
— Transparent reporting per CONSORT, STROBE, or PRISMA guidelines requires disclosure of how central tendency was chosen
— Journals increasingly require both mean (SD) and median (IQR) for continuous variables
— Reviewers should flag mean reporting for obviously skewed variables (LOS, cost, time)
— EHR-based dashboards that display "average" metrics may mislead clinicians and administrators
— Push for median + IQR + percentile displays for time-based and cost-based metrics
— Total resource calculation: mean LOS × number of patients = total bed-days (useful for budgeting)
— Mean is the right tool when summed totals matter; median is the right tool when typical patient experience matters
— Pull the data
— Plot the distribution (histogram)
— Report median + IQR + 90th percentile
— Investigate the long tail for system failures

— Range: max − min; most sensitive to outliers, least informative
— IQR (Q3 − Q1): middle 50% spread; robust to outliers; pairs with median
— SD: root-mean-square deviation from mean; pairs with mean; assumes symmetry
— Variance: SD²; same units issue as SD
— Percentile: value below which a given % of observations fall (median = 50th percentile)
— Quartiles: 25th (Q1), 50th (Q2 = median), 75th (Q3)
— Quintiles: 20th, 40th, 60th, 80th — used in socioeconomic stratification
— Z-score: (value − mean)/SD; valid only for normal data
— Percentile: rank-based; robust to distribution shape
— Pediatric growth uses percentiles, not z-scores, in clinical practice (though research uses both)
— Skewness: asymmetry (left vs right tail)
— Kurtosis: tail heaviness (peaked vs flat)
— Both deviate from normal in different ways
— For discrete data: mode = single most common value
— For continuous data: modal class = histogram bin with the highest frequency
— Trimmed mean: drop top and bottom X% and average the rest
— Compromise between mean (efficiency under normality) and median (robustness)
— Used in Olympic scoring (drop highest and lowest judges)
— Each value multiplied by a weight (e.g., sample size in meta-analysis)
— Used in pooled estimates (random-effects and fixed-effects meta-analysis)

— Central tendency (mean, median, mode) describes "center"
— Spread (SD, IQR, range) describes "scatter"
— Shape (skewness, kurtosis) describes asymmetry and tails
— All three are needed to characterize a distribution
— Sample mean (x̄) estimates population mean (μ)
— Sample SD (s) estimates population SD (σ)
— Standard error (SEM) = SD/√n — describes precision of the sample mean, NOT spread of data
— Confusing SD with SEM is a classic Step 3 trap: SEM is always smaller than SD and shrinks with n
— 95% CI of mean: range likely to contain the true mean (based on SEM)
— 95% reference range: range containing 95% of individual values (based on SD or percentiles)
— These are commonly confused; reference range >> CI for the same data
— Incidence: new cases per person-time
— Prevalence: existing cases at a point in time
— Duration of disease shifts the prevalence-to-incidence ratio; chronic diseases have prevalence >> incidence
— In RCTs with normal outcomes, report mean difference (95% CI)
— In RCTs with skewed outcomes (LOS, cost, time-to-event), report median difference or hazard ratio
— Effect size measures (Cohen's d, Hedges' g) assume normality
— Pearson r assumes bivariate normality
— Spearman ρ uses ranks, robust to skew
— Regression on skewed outcomes may need log-transformation or generalized linear models

— For continuous variables, always inspect distribution before choosing summary statistic
— Report median (IQR) for time, cost, biomarkers with wide ranges, and any clearly skewed variable
— Report mean (SD) only after verifying approximate symmetry
— Report mode for categorical and ordinal variables
— Check Table 1 of any clinical paper: are continuous variables described by mean (SD), median (IQR), or both?
— If only mean (SD) is reported for cost, LOS, or time variables → potential misrepresentation
— Look for histograms, boxplots, or distribution plots in the supplement
— Advocate for median + IQR + 90th percentile displays for time-based metrics
— Push for percentile-based goals (e.g., "90% of patients receive antibiotics within 60 minutes") rather than mean-based goals
— "Median survival" is more honest and clinically meaningful than "average survival"
— Use percentiles for growth, BP, weight discussions ("you're in the 75th percentile for...")
— Avoid "average" language for highly variable outcomes
— Lab reference ranges should use percentile-based limits for skewed analytes
— Trend graphs should show patient values against percentile bands, not mean ± SD
— Pre-specify analysis methods (mean vs median; parametric vs nonparametric) in study protocols
— Power calculations for skewed outcomes often require simulation or assume transformation
— Most physicians underestimate how often biomedical data are skewed
— Annual biostatistics refresher embedded in QI and journal club is high-yield

— Track median and IQR of process measures over time (run charts, control charts)
— A widening IQR suggests increasing variability — investigate causes
— A shifting median over months signals a real process change, separable from outlier-driven noise
— Re-examine distributions at each follow-up wave; treatment or aging may change skewness
— Example: cholesterol distributions shift left with widespread statin use; what was once right-skewed becomes more symmetric
— Pre-specify primary analysis (parametric vs nonparametric) in the SAP
— Conduct sensitivity analyses with alternative methods
— Report both mean and median for transparency
— Annual reports of median household income, median home value, median life expectancy — track inequality through gap between mean and median (or between top and bottom quintiles)
— Widening mean-median gap signals growing inequality
— Patient-reported outcomes (pain, fatigue, function) are ordinal — track median trajectory, not mean
— Lab trends: when comparing serial values, use the patient's own baseline distribution, not population mean
— Explain "median survival" honestly: half of patients live longer, half shorter; your individual outcome cannot be predicted from the median alone
— Use percentile language for growth and developmental milestones
— When transferring care or writing referral letters with quantitative data, specify which summary statistic is reported
— "Cohort median A1c 7.2%" tells a different story than "cohort mean A1c 7.8%"

— Quoting "average survival" for a cancer with highly skewed survival can be misleading
— Ethically, clinicians should disclose median survival plus the range or percentiles (e.g., "median 14 months, with 10% of patients alive at 5 years")
— Patient autonomy requires accurate information; framing skewed data as a single "average" undermines autonomous decision-making
— Selectively reporting mean vs median based on which yields a "better" or significant p-value = p-hacking
— Pre-registered analysis plans (e.g., on ClinicalTrials.gov) prevent this
— IRB protocols should specify summary statistics in advance
— Deleting outliers without documented justification can constitute research misconduct
— Reporting only the mean for income, health outcomes, or access metrics masks disparities at the tails
— Median + percentile gaps (e.g., 10th vs 90th percentile life expectancy by ZIP code) reveal inequity that means hide
— Ethically, public health reporting should include distributional measures, not just averages
— Handoffs that summarize a patient's recent course with "average vital signs" can hide dangerous outlier episodes
— Safer practice: communicate range, trend, and any outlier events (e.g., "BP median 130/80, but one episode of 200/110 yesterday")
— Missed outlier events in handoffs are a documented patient safety hazard
— CMS publicly reports hospital metrics; risk-adjusted medians are fairer to tertiary centers than means
— Hospitals serving sicker populations are unfairly penalized by mean-based comparisons
— Outbreak surveillance uses incidence rates, not "average cases" — Step 3 reportable disease questions require count-based and rate-based metrics, not means

— Right skew → mode < median < mean → tail right → use median
— Left skew → mean < median < mode → tail left → use median
— Symmetric → mean = median = mode → use mean
— Bimodal → two modes → mean and median may sit in valley → report distribution
— Hospital length of stay
— Healthcare costs, charges, expenditures
— Wait times (ED, OR, clinic)
— Viral load (HIV, HBV, HCV before treatment)
— Antibody titers (use geometric mean)
— CRP, ferritin, IgE, troponin
— Drug half-lives across populations
— Time to event (survival, time-to-readmission)
— Household income
— Parasite egg counts
— Age at death in developed countries
— Gestational age at delivery (healthy cohort)
— Apgar scores in healthy newborns
— Performance on easy exams (ceiling effect)
— Adult height
— Birth weight (mild left skew but often treated as normal)
— Systolic BP in healthy young adults
— Hemoglobin in healthy adults
— IQ scores (by design)
— Age of onset: Hodgkin lymphoma, IBD, some leukemias
— Anything mixing two populations (men + women heights, treated + untreated patients)
— Mean ↔ SD ↔ t-test, ANOVA, Pearson r ↔ normal distribution
— Median ↔ IQR ↔ Wilcoxon, Kruskal-Wallis, Spearman ρ ↔ skewed data
— Mode ↔ frequencies/proportions ↔ chi-square, Fisher exact ↔ categorical
— Right-skewed positive data → log transform → often normal
— Proportions → logit or arcsine transform
— Counts → square-root transform
— Antibody titers → geometric mean titer (GMT)
— Survival → median survival + Kaplan-Meier curve
— Cost → median (IQR) + sometimes mean for total resource estimates
— Growth → percentiles

— Stem: gives data on hospital LOS, cost, or time with an obvious outlier
— Answer: median
— Distractors: mean (sensitive to outlier), mode (not informative for continuous data), range (not central tendency)
— Stem: mean > median by a lot, or histogram with tail to the right described
— Answer: right (positive) skew
— Trap: students reverse direction because tail vs bulk confusion
— Stem: compares two groups on a skewed outcome (LOS, cost)
— Answer: Wilcoxon rank-sum (Mann-Whitney U) or log-transform then t-test
— Distractors: t-test (assumes normality), chi-square (categorical), Pearson (correlation)
— Stem: pain scale, NYHA class, Likert satisfaction scores
— Answer: median
— Distractors: mean (assumes equal intervals), mode (only if specifically asked for "most common")
— Stem: gives a small dataset, then adds an extreme value
— Answer: mean changes substantially, median barely changes, mode unchanged
— Stem: describes a symmetric but bimodal distribution
— Answer: NOT normal; cannot apply t-test directly
— Stem: oncology trial with censored data
— Answer: report median overall survival, not mean
— Stem: ferritin, CRP, viral load reference range
— Answer: use percentile-based reference (2.5th–97.5th), not mean ± 2 SD
— Stem: door-to-balloon time, sepsis bundle compliance
— Answer: median + 90th percentile, not mean
— Stem: vaccine immunogenicity, antibody titers
— Answer: geometric mean titer (GMT)
— Stem: household income across ZIP codes, healthcare spending distribution
— Answer: median + percentile gap, not mean

In skewed distributions, the mean is pulled toward the tail while the median stays at the rank-based center and the mode marks the peak — so right-skewed data follow mode < median < mean, left-skewed data follow mean < median < mode, and the median (with IQR) is the preferred summary for skewed, ordinal, or outlier-prone biomedical data.
— Direction of skew is named for the tail: right skew has a long right tail, mean > median; left skew has a long left tail, mean < median
— Default to median (IQR) for hospital LOS, healthcare cost, wait times, biomarkers (CRP, ferritin, viral load), survival times, antibody titers, and any ordinal scale (pain, NYHA, Apgar)
— Use mean (SD) only for approximately symmetric continuous data (height, BP in healthy adults, hemoglobin); use mode for categorical/nominal data
— Match test to data: parametric (t-test, ANOVA, Pearson) for normal; nonparametric (Wilcoxon, Kruskal-Wallis, Spearman) for skewed; consider log transformation for right-skewed positive data, geometric mean for antibody titers, and Kaplan-Meier with median survival for time-to-event data

