Biostatistics & Population Health
Chi-square test: when to use and interpretation
— Both exposure and outcome are categorical (nominal or ordinal collapsed to categories)
— Data are presented as counts in a contingency table (2×2, 2×3, R×C), not means or medians
— Question asks: "Is there an association between X and Y?" where X and Y are groupings (smoker vs nonsmoker; cured vs not cured; drug A vs B vs C)
— Chi-square test of independence: two variables in one sample (e.g., smoking status × lung cancer status)
— Chi-square test of homogeneity: same categorical variable across multiple populations (e.g., complication rate across 3 hospitals)
— Chi-square goodness-of-fit: observed distribution vs theoretical/expected (e.g., does observed allele frequency match Hardy-Weinberg)
— Comparing proportions cured between treatment arms
— Adverse event rates across drug groups in an RCT
— Screening test uptake by demographic category
— Categorical quality improvement outcomes (readmitted yes/no by unit)
Board pearl: If you see a contingency table of counts and the question stem asks whether two categorical variables are associated or whether proportions differ between groups, the answer is chi-square (or Fisher exact if expected cell counts are small). Means → t-test/ANOVA; counts/proportions → chi-square.

— "Proportion of patients who…"
— "Percentage with the outcome…"
— "Number cured vs not cured…"
— "Compared the rate of [binary event] between groups"
— Investigators enroll patients with condition X
— Randomize or stratify into 2 or more discrete groups (drug A vs placebo; intervention vs usual care)
— Outcome measured as yes/no, cured/not cured, alive/dead, readmitted/not
— Results presented in a 2×2 or R×C table
— If outcome is blood pressure in mmHg across 2 groups → t-test, not chi-square
— If outcome is HbA1c across 3 drug arms → ANOVA
— If two continuous variables compared (LDL vs BMI) → correlation/regression
— If time-to-event with censoring → log-rank / Kaplan-Meier, not chi-square
— Independent samples required for standard chi-square; if data are paired (before/after same patient, matched case-control on binary outcome), use McNemar test
— Small samples (any expected cell count < 5 in a 2×2) → use Fisher exact test
— If the categorical outcome is ordered (mild/moderate/severe) and you want to detect a trend → chi-square test for trend (Cochran-Armitage) is more powerful than generic chi-square
Key distinction: Chi-square assumes independent observations and adequate expected counts. The two most common Step 3 traps are (1) confusing paired data (use McNemar) with independent data, and (2) ignoring the expected cell count < 5 rule, which mandates Fisher exact in small 2×2 tables.

| • Every chi-square question hinges on reading the contingency table correctly. Standard 2×2 layout: | |||
| Outcome + | Outcome − | Row total | |
| Exposure + | a | b | a+b |
| Exposure − | c | d | c+d |
| Column total | a+c | b+d | N |
| • Observed (O) values are the cell counts a, b, c, d | |||
| • Expected (E) under the null (no association) for each cell: | |||
| — E = (row total × column total) / grand total | |||
| — Example: E for cell a = (a+b)(a+c)/N | |||
| • Chi-square statistic: χ² = Σ (O − E)² / E across all cells | |||
| • Degrees of freedom (df) for a contingency table: | |||
| — df = (rows − 1) × (columns − 1) | |||
| — 2×2 table → df = 1 | |||
| — 2×3 table → df = 2 | |||
| — 3×4 table → df = 6 | |||
| • Critical values worth memorizing (α = 0.05): | |||
| — df = 1: χ² > 3.84 is significant | |||
| — df = 2: χ² > 5.99 | |||
| — df = 3: χ² > 7.81 | |||
| • Effect measures derivable from the same 2×2: | |||
| — Odds ratio (OR) = (a×d) / (b×c) — case-control studies | |||
| — Relative risk (RR) = [a/(a+b)] / [c/(c+d)] — cohort/RCT | |||
| — Chi-square tells you whether there is an association; OR/RR tell you how strong and in which direction | |||
| Step 3 management (of test selection): Given a 2×2 table, first verify all expected counts ≥ 5. If yes → chi-square with df=1, reject null if χ² > 3.84 or p < 0.05. If any expected count < 5 → Fisher exact. If observations are paired → McNemar, using only the discordant pairs (b and c cells). |

— Step 1: Confirm outcome variable is categorical (not continuous). If continuous, exit to t-test/ANOVA/regression
— Step 2: Determine number of groups and whether independent
— Step 3: Check expected cell counts and pairing
— Step 4: Choose ordinal trend test if categories are ordered
— 2 independent groups, binary outcome, expected ≥ 5 → chi-square (2×2) or equivalently z-test for two proportions
— 2 independent groups, binary outcome, expected < 5 in any cell → Fisher exact test
— Paired/matched binary data (same person before/after; matched pairs) → McNemar test
— ≥3 independent groups, binary or nominal outcome → chi-square R×C
— ≥3 groups, ordinal outcome with hypothesized trend → Cochran-Armitage trend test
— Stratified 2×2 tables (controlling for confounder) → Cochran-Mantel-Haenszel
— Repeated measures on same subjects across ≥3 time points (binary) → Cochran Q
— Chi-square vs z-test for two proportions: mathematically equivalent for 2×2; either is acceptable
— Chi-square vs logistic regression: chi-square = bivariate association; logistic = adjusts for multiple covariates
— Chi-square vs t-test: t-test is for means of continuous data, never for proportions
Board pearl: When a stem mentions "adjusted for age, sex, and comorbidities" with a categorical outcome, the test is logistic regression, not chi-square. Chi-square is unadjusted; once confounders enter the analysis, you move to multivariable models. This is a frequent Step 3 distractor among the answer choices.

| • Worked example: RCT of new antibiotic vs placebo for cellulitis cure at 7 days | |||
| Cured | Not cured | Total | |
| Drug | 80 | 20 | 100 |
| Placebo | 60 | 40 | 100 |
| Total | 140 | 60 | 200 |
| • Compute expected values (row total × column total / N): | |||
| — E(Drug, Cured) = (100 × 140)/200 = 70 | |||
| — E(Drug, Not cured) = (100 × 60)/200 = 30 | |||
| — E(Placebo, Cured) = (100 × 140)/200 = 70 | |||
| — E(Placebo, Not cured) = (100 × 60)/200 = 30 | |||
| • Compute χ² = Σ (O−E)²/E: | |||
| — (80−70)²/70 = 100/70 = 1.43 | |||
| — (20−30)²/30 = 100/30 = 3.33 | |||
| — (60−70)²/70 = 100/70 = 1.43 | |||
| — (40−30)²/30 = 100/30 = 3.33 | |||
| — Total χ² = 9.52 | |||
| • Interpretation: | |||
| — df = (2−1)(2−1) = 1 | |||
| — Critical value at α=0.05, df=1 = 3.84 | |||
| — 9.52 > 3.84 → reject the null; p < 0.05 (actually p ≈ 0.002) | |||
| — Conclude: cure proportions differ significantly between drug and placebo | |||
| • Pair with effect size: | |||
| — Absolute risk reduction = 80% − 60% = 20% | |||
| — Number needed to treat = 1/0.20 = 5 | |||
| — Relative risk = 0.80/0.60 = 1.33 (33% relative increase in cure) | |||
| Board pearl: Knowing χ²crit = 3.84 at df=1 lets you eyeball whether a 2×2 result is significant without a calculator. If the question gives χ² and df, compare directly. The exam rarely demands the full calculation — usually it tests interpretation of a given χ² and p-value. |

— Probability of observing the data (or more extreme) assuming the null hypothesis is true
— NOT the probability that the null is true; NOT the probability the result is due to chance alone
— p < 0.05 → "statistically significant"; reject null
— p ≥ 0.05 → fail to reject; does not prove the null
— "The probability the drug doesn't work is 5%" → wrong
— "There is a 95% chance the drug works" → wrong
— "The result will be replicated 95% of the time" → wrong
— Correct: "If the drug truly had no effect, we'd see results this extreme < 5% of the time"
— Large N can render trivial differences statistically significant
— A χ² with p = 0.001 but ARR of 0.5% may not warrant practice change
— Always pair p-value with effect size (RR, OR, ARR, NNT) and 95% CI
— A 95% CI for the risk difference or OR that excludes 0 (for differences) or 1 (for ratios) implies p < 0.05
— CI gives precision; chi-square gives only a binary reject/accept decision
Key distinction: Chi-square answers "is there an association?" — it is a hypothesis test. Odds ratio, relative risk, and risk difference (with confidence intervals) answer "how strong is the association?" — these are effect measures. Step 3 questions love to ask which is appropriate when a clinician needs to counsel a patient on magnitude of benefit: the answer is the effect estimate, not the p-value.

| • Core assumptions (memorize for the recognition stem): | ||
| — Independence of observations — each subject contributes to only one cell | ||
| — Mutually exclusive categories — no subject counted twice | ||
| — Random sampling from the population of interest | ||
| — Adequate expected frequencies — all expected counts ≥ 5 (more lenient: ≥80% of cells with E ≥ 5 and no cell with E < 1 in larger tables) | ||
| — Counts, not percentages or proportions, must be the input data | ||
| • Violations and their fixes: | ||
| — Paired data → McNemar test | ||
| — Small expected counts in 2×2 → Fisher exact test | ||
| — Small expected counts in R×C → Fisher-Freeman-Halton extension or collapse categories | ||
| — Repeated measures on same subjects → Cochran Q or GEE | ||
| — Need to adjust for confounders → logistic regression or Mantel-Haenszel stratified analysis | ||
| — Ordered categories with trend → Cochran-Armitage trend test | ||
| • Common Step 3 trap — using percentages: | ||
| — A table presenting "60% vs 80% cure" without underlying N cannot be tested — you need the actual counts | ||
| — If only percentages are given, the question often expects you to identify that insufficient data are presented for a valid chi-square | ||
| • Continuity correction: | ||
| — Yates correction subtracts 0.5 from | O−E | before squaring; reduces type I error in small 2×2 samples |
| — Now considered overly conservative; many modern texts skip it | ||
| — Step 3 rarely tests the math but may mention it as a refinement | ||
| Board pearl: When any expected cell count is < 5, switch to Fisher exact test. This shows up most often in small RCTs, rare adverse events, and pilot studies. The exam loves a 2×2 with cells like 1, 2, 3, 4 — that is a Fisher exact stem, not chi-square. |

— Used for pre/post on the same subject or matched pairs
— Examples: pre/post-intervention symptom status; matched case-control with binary exposure; diagnostic test comparison (test A vs test B on same patients)
— Uses only discordant pairs (b and c in 2×2): χ² = (b−c)²/(b+c), df=1
— Concordant pairs (both yes or both no) contribute no information about change
— Used when expected counts < 5 in 2×2 tables
— Calculates exact probability rather than approximating with χ² distribution
— Common in small surgical series, rare disease studies, pilot trials
— Tests association between exposure and outcome adjusting for a stratifying variable (confounder)
— Produces a pooled OR across strata
— Example: smoking → lung cancer, stratified by age decade
— If strata are heterogeneous (Breslow-Day test significant), CMH not appropriate — use logistic regression
— Detects linear trend across ordered categorical exposure levels (e.g., never/former/current smoker → MI risk)
— More powerful than generic R×C chi-square when trend is plausible
— Compares one sample's distribution to a theoretical one
— Hardy-Weinberg equilibrium, Mendelian ratios, expected demographic distributions
Step 3 management: Identify the data structure first. Paired = McNemar; small cells = Fisher; stratified = Mantel-Haenszel; ordered trend = Cochran-Armitage; goodness-of-fit to expected distribution = standard χ² goodness-of-fit. Memorize this five-item map and most biostatistics test-selection stems become trivial.

— The χ² statistic approximates a continuous distribution; with small N, expected counts drop and the approximation fails, inflating type I error
— A "significant" χ² in a study with N = 20 may be a statistical artifact
— Any expected (not observed) cell count < 5 in a 2×2 table
— In larger R×C tables: any expected < 1, or > 20% of cells with expected < 5
— Fisher exact test — preferred for small 2×2; computationally exact
— Collapse categories — combine sparse columns (e.g., merge "severe" and "very severe") to boost expected counts; must be clinically justifiable, not data-dredged
— Exact tests for R×C — Fisher-Freeman-Halton extension
— Bayesian methods — used in modern small-trial design but rarely on Step 3
— Phase 1/2 trials with 20-40 patients
— Surgical morbidity reviews of uncommon procedures
— Outbreak investigations of small clusters
— Pediatric rare disease cohorts
— Power affects type II error (missing a real effect)
— Small expected cell counts affect type I error validity of the test itself
— Both can co-exist: a tiny study may use a valid Fisher test but still miss real effects
Board pearl: A stem showing a 2×2 like (1, 9 / 0, 10) — say, 0 vs 1 adverse event in two arms of 10 patients — is a Fisher exact question, not chi-square. The answer choice "chi-square test" is the distractor; recognizing expected counts < 5 is the testing point.

— A significant R×C χ² tells you at least one pair of proportions differs — not which
— Performing pairwise 2×2 chi-squares inflates type I error
— With k = 3 groups, 3 pairwise tests at α=0.05 → family-wise error ≈ 14%
— Bonferroni: divide α by number of comparisons (3 comparisons → α=0.0167 each); conservative
— Holm-Bonferroni: stepwise, less conservative
— Benjamini-Hochberg (FDR): controls false discovery rate; common in genomics
— A trial with overall p = 0.06 that finds "significant" benefit in women (p=0.04) is performing multiple comparisons
— Always interpret subgroup χ² results as hypothesis-generating, not confirmatory
— Pre-specified subgroups (in protocol) carry more weight than post-hoc ones
— Formal way to ask whether effect differs across strata
— Tested with an interaction term in regression, not a chi-square comparison of subgroup p-values
Key distinction: A significant R×C chi-square means "the groups differ somewhere"; it does not identify which groups. Follow-up pairwise comparisons require multiplicity adjustment. On Step 3, when a stem lists three drug arms with one p-value, recognize that the omnibus test is significant but pairwise conclusions require additional analysis with corrected α.

— Huge studies (N=50,000) can detect trivial differences
— A χ² with p<0.001 and ARR of 0.2% may not justify treatment
— Chi-square shows two variables co-vary, nothing more
— Need study design (RCT > cohort > case-control) and Bradford Hill considerations to infer causation
— Aggregated 2×2 reverses direction when stratified by a third variable
— Classic UC Berkeley admissions example
— Mantel-Haenszel or logistic regression unmasks this
— Group-level associations may not hold at the individual level
— Common in public health data
— "50% relative reduction" sounds dramatic but may be 2% → 1% absolute
— Chi-square p-value gives no sense of magnitude
— Subjects with missing outcome dropped from chi-square — can bias if missingness is informative
— Intention-to-treat analyses handle this in RCTs
— Publication bias and selective subgroup reporting inflate apparent effects
Step 3 management: When a stem reports a "significant chi-square," reflexively ask:
— Was the outcome clinically important (NNT, ARR)?
— Are there confounders unaccounted for?
— Is this a subgroup or post-hoc finding?
— Is the CI wide, suggesting imprecision?
— Could chance, bias, or confounding still explain the result?
Statistical significance is necessary but not sufficient to change practice.

— Need to adjust for confounders (age, sex, comorbidities, baseline severity)
— Multiple simultaneous predictors of interest
— Interest in interaction effects between variables
— Time-varying covariates
— Repeated measures or clustered data (patients within hospitals)
— Binary outcome + multiple predictors → logistic regression (yields adjusted OR)
— Time-to-event outcome → Cox proportional hazards (yields HR)
— Count outcome → Poisson or negative binomial regression
— Continuous outcome + predictors → linear regression
— Clustered/repeated data → GEE or mixed-effects models
— Adjusted odds ratios with 95% CI
— Ability to handle continuous and categorical predictors together
— Tests for effect modification (interaction terms)
— Predicted probability for individual patients
— Observational study finds smoking → MI via χ² (p<0.001)
— But smokers are older, more hypertensive, more diabetic
— Logistic regression adjusting for these covariates yields the true independent effect of smoking
— Chi-square alone is inadequate for causal inference in observational data
CCS pearl: In an outpatient research-interpretation stem, when the question asks "what additional analysis is needed" after a significant unadjusted association, the correct upgrade is almost always multivariable logistic regression (for binary outcomes) or Cox regression (for time-to-event). Chi-square is the screening test; regression is the definitive analysis.

— Mathematically equivalent to 2×2 chi-square
— Yields a z-score; z² = χ²
— Use either; results identical
— For small expected counts
— Computes exact p-value via hypergeometric distribution
— More conservative than chi-square in small samples
— Paired binary data
— Tests whether discordant pair proportions differ
— Common for diagnostic test comparison and pre/post studies
— Extension of McNemar to ≥3 repeated binary measures on same subjects
— Example: same patient evaluated by 3 different imaging modalities (positive/negative)
— Stratified 2×2 analysis; pooled OR adjusting for one confounder
— Used when sparse data prevent full logistic regression
— Ordered exposure categories; tests for linear trend in proportions
— Alternative formulation; same df and critical values
— Used in log-linear models for higher-dimensional contingency tables
— Not a chi-square per se, but produces a χ² statistic
— Compares survival curves between groups
— Different from chi-square: handles censored time-to-event data
Key distinction: All these tests yield a χ² statistic and a p-value, but they answer different questions. Step 3 stems exploit this — read the data structure (paired? stratified? ordered? censored?) to pick the right test, even when several answer choices "feel like" chi-square.

— Normally distributed → Student's t-test (independent samples)
— Paired/repeated on same subject → paired t-test
— Non-normal or ordinal → Mann-Whitney U (independent) or Wilcoxon signed-rank (paired)
— Normally distributed → one-way ANOVA
— Non-normal → Kruskal-Wallis test
— Repeated measures → repeated-measures ANOVA or Friedman test
— Linear association → Pearson correlation (parametric) or Spearman (rank-based)
— Predictive model → linear regression
— Kaplan-Meier curves for visualization
— Log-rank test for comparison
— Cox proportional hazards regression for adjusted hazard ratios
— Sensitivity, specificity, PPV, NPV from 2×2 tables (descriptive, not chi-square)
— ROC curve / AUC for discrimination
— Likelihood ratios for clinical decision-making
— Stem reports "mean HbA1c was 7.2 vs 7.8" — this is continuous, not categorical → t-test, not chi-square
— Stem reports "60% vs 75% achieved goal HbA1c <7.0" — this is binary → chi-square
— Same trial can be analyzed both ways depending on how outcome is defined
Board pearl: Decide test by outcome data type first, then number of groups, then independence. Means → t-test/ANOVA. Proportions → chi-square. Time-to-event → log-rank. Correlation between continuous variables → Pearson/Spearman. This four-bucket scheme handles ~90% of Step 3 biostats test-selection stems.

— Hand hygiene compliance before/after intervention → McNemar if same observers/units across time; chi-square if independent samples
— Readmission rates across hospital units → R×C chi-square
— Door-to-balloon time within 90 min: yes/no across two years → 2×2 chi-square
— Vaccination rates across clinics → chi-square
— Disease prevalence by demographic group → chi-square
— Outbreak investigation: exposed vs unexposed, ill vs not ill → 2×2 chi-square (often expanded with attack rates)
— Screening uptake by insurance status → chi-square
— Categorical outcome (received guideline-concordant care, yes/no) across race/ethnicity → chi-square for unadjusted; logistic regression for adjusted
— Pre/post payment-model change in proportion meeting quality benchmark → chi-square or McNemar depending on whether same providers
— Adherence to a new protocol across hospital wards → chi-square
— Use run charts and statistical process control (SPC) for longitudinal trends rather than repeated chi-squares
Step 3 management: For a QI project measuring a binary process or outcome metric (compliance, readmission, complication, immunization uptake) across two or more independent groups, the default analysis is chi-square. If measuring the same group over time (paired pre/post), use McNemar. If trending across many time points, use SPC charts or interrupted time-series analysis rather than chi-square.

— Test name (and any variant: Fisher exact, McNemar, etc.)
— χ² value
— Degrees of freedom
— Exact p-value (not just "p<0.05")
— Effect size: OR, RR, ARR, or risk difference with 95% CI
— Sample sizes in each group
— "Cure rates differed between groups (80/100 [80%] vs 60/100 [60%]; χ²=9.52, df=1, p=0.002; risk difference 20% [95% CI 7–33%]; NNT 5)"
— Only "p<0.05" without χ² or CI → insufficient
— Percentages without denominators → cannot reproduce
— No effect size → reader cannot judge clinical importance
— CONSORT for RCT reporting requires both significance testing and effect estimates with CIs
— STROBE for observational studies similar
— Translate chi-square + effect size into shared decision-making language for patients
— "Out of every 100 patients treated, 20 more are cured than with placebo" beats "p < 0.002"
— Number needed to treat (NNT) and absolute risk reduction (ARR) are the patient-facing translations
— Patients value absolute risk reduction over relative risk reduction; literature shows informed-consent quality improves with absolute terms
CCS pearl: When asked to counsel a patient or family about a study result, do not quote the chi-square p-value. Instead, translate to absolute risk reduction and NNT, frame the time horizon, and acknowledge uncertainty (the confidence interval). This is the Step 3 communication standard for evidence-based shared decision-making.

— Patients must understand magnitude and uncertainty of benefits/risks, not just statistical significance
— Quoting a p-value without ARR or NNT may violate the spirit of informed consent
— Studies show patients overestimate benefit when only relative risk reduction is presented — present absolute numbers
— Running many chi-squares and reporting only the significant findings is p-hacking — a research integrity violation
— Pre-registration of analyses (e.g., on ClinicalTrials.gov) is now standard for RCTs
— Post-hoc subgroup analyses must be labeled as hypothesis-generating
— Discharge medications and follow-up plans should be grounded in evidence with adequate effect size, not borderline significance from small studies
— A new drug with χ² p=0.04 but ARR of 0.5% may not warrant prescribing in a frail elderly patient at discharge
— Chi-square stratified by race/ethnicity reveals disparities but must be reported responsibly — avoid attributing differences to biology when social determinants are operative
— Outbreak investigations using chi-square (attack rate by exposure) often trigger public health reporting to local/state health departments
— Clinicians must recognize when a research-style 2×2 is actually a reportable cluster
— Industry-sponsored trials may emphasize relative over absolute risk; readers must demand both
Step 3 management: When counseling a patient on entering a clinical trial or accepting a new therapy, present (1) absolute risk reduction, (2) number needed to treat, (3) number needed to harm, and (4) uncertainty (95% CI). Reliance on p-values alone is ethically inadequate for informed consent and shared decision-making — a recurring Step 3 communication theme.

Board pearl: If you only memorize one cutoff for Step 3 biostatistics, make it χ² > 3.84 with df=1 → p < 0.05. Combined with "counts → chi-square, means → t-test, time-to-event → log-rank," you can crack the majority of biostatistics test-selection items.

— Stem provides cure rates in two groups as counts
— Asks: "Which is the most appropriate statistical test?"
— Answer: chi-square (or Fisher exact if expected counts < 5)
— Distractors: t-test, ANOVA, Pearson correlation, log-rank
— Stem reports χ²=12.3, df=1, p=0.0005
— Asks: "What does this mean?"
— Answer: groups differ significantly in the categorical outcome; reject null
— Distractors: "drug works in 99.95% of patients" (wrong interpretation of p)
— Stem describes "before and after the intervention in the same 50 patients"
— Asks for test choice
— Answer: McNemar, not chi-square
— Stem shows 2×2 with cells like 1, 9, 2, 8
— Answer: Fisher exact, not chi-square
— Three drug arms, binary outcome, one p-value reported
— Asks what additional analysis is needed
— Answer: post-hoc pairwise tests with multiplicity correction (Bonferroni)
— Observational study shows unadjusted χ² association
— Asks for next step
— Answer: multivariable logistic regression for adjusted OR
— Stem gives a significant χ² and asks how to counsel patient
— Answer: present absolute risk reduction and NNT, not p-value alone
— Genetics scenario comparing observed to Mendelian ratios
— Answer: chi-square goodness-of-fit
— Aggregate result differs from stratified result
— Answer: Cochran-Mantel-Haenszel or logistic regression with confounder
Step 3 management: For every biostatistics stem, identify in order: (1) outcome type (categorical/continuous/time-to-event), (2) number of groups, (3) independence/pairing, (4) sample adequacy, (5) need for adjustment. This five-step triage solves the majority of biostats items.

The chi-square test compares observed versus expected counts to determine whether two categorical variables are associated, and its correct use depends on independent observations, adequate expected cell counts (≥5), and pairing the p-value with an effect size for clinically meaningful interpretation.

