Biostatistics & Population Health

Chi-square test: when to use and interpretation

Clinical Overview and When to Suspect Chi-Square Use

— Both exposure and outcome are categorical (nominal or ordinal collapsed to categories)

— Data are presented as counts in a contingency table (2×2, 2×3, R×C), not means or medians

— Question asks: "Is there an association between X and Y?" where X and Y are groupings (smoker vs nonsmoker; cured vs not cured; drug A vs B vs C)

— Chi-square test of independence: two variables in one sample (e.g., smoking status × lung cancer status)

— Chi-square test of homogeneity: same categorical variable across multiple populations (e.g., complication rate across 3 hospitals)

— Chi-square goodness-of-fit: observed distribution vs theoretical/expected (e.g., does observed allele frequency match Hardy-Weinberg)

— Comparing proportions cured between treatment arms

— Adverse event rates across drug groups in an RCT

— Screening test uptake by demographic category

— Categorical quality improvement outcomes (readmitted yes/no by unit)

Board pearl: If you see a contingency table of counts and the question stem asks whether two categorical variables are associated or whether proportions differ between groups, the answer is chi-square (or Fisher exact if expected cell counts are small). Means → t-test/ANOVA; counts/proportions → chi-square.

Chi-square (χ²) test: nonparametric test of association between two categorical variables, comparing observed vs expected frequencies under a null hypothesis of independence

When to suspect chi-square is the right test on Step 3:

Three flavors tested:

Clinical research scenarios where chi-square dominates:

Why this matters at Step 3 level: ambulatory and QI questions increasingly present 2×2 tables of outcomes and ask which test to apply, or give a p-value and ask interpretation — picking chi-square vs t-test vs ANOVA is a recurring stem trap

Presentation Patterns and Key History — Recognizing the Setup

— "Proportion of patients who…"

— "Percentage with the outcome…"

— "Number cured vs not cured…"

— "Compared the rate of [binary event] between groups"

— Investigators enroll patients with condition X

— Randomize or stratify into 2 or more discrete groups (drug A vs placebo; intervention vs usual care)

— Outcome measured as yes/no, cured/not cured, alive/dead, readmitted/not

— Results presented in a 2×2 or R×C table

— If outcome is blood pressure in mmHg across 2 groups → t-test, not chi-square

— If outcome is HbA1c across 3 drug arms → ANOVA

— If two continuous variables compared (LDL vs BMI) → correlation/regression

— If time-to-event with censoring → log-rank / Kaplan-Meier, not chi-square

— Independent samples required for standard chi-square; if data are paired (before/after same patient, matched case-control on binary outcome), use McNemar test

— Small samples (any expected cell count < 5 in a 2×2) → use Fisher exact test

— If the categorical outcome is ordered (mild/moderate/severe) and you want to detect a trend → chi-square test for trend (Cochran-Armitage) is more powerful than generic chi-square

Key distinction: Chi-square assumes independent observations and adequate expected counts. The two most common Step 3 traps are (1) confusing paired data (use McNemar) with independent data, and (2) ignoring the expected cell count < 5 rule, which mandates Fisher exact in small 2×2 tables.

Step 3 stems disguise chi-square questions in clinical research vignettes. Learn the linguistic cues:

Typical vignette skeleton:

Distinguish from neighboring stems:

Sample size and matching cues:

Ordinal outcomes:

Structural Anatomy of the 2×2 Table — Exam Setup

• Every chi-square question hinges on reading the contingency table correctly. Standard 2×2 layout:
Outcome +	Outcome −	Row total
Exposure +	a	b	a+b
Exposure −	c	d	c+d
Column total	a+c	b+d	N
• Observed (O) values are the cell counts a, b, c, d
• Expected (E) under the null (no association) for each cell:
— E = (row total × column total) / grand total
— Example: E for cell a = (a+b)(a+c)/N
• Chi-square statistic: χ² = Σ (O − E)² / E across all cells
• Degrees of freedom (df) for a contingency table:
— df = (rows − 1) × (columns − 1)
— 2×2 table → df = 1
— 2×3 table → df = 2
— 3×4 table → df = 6
• Critical values worth memorizing (α = 0.05):
— df = 1: χ² > 3.84 is significant
— df = 2: χ² > 5.99
— df = 3: χ² > 7.81
• Effect measures derivable from the same 2×2:
— Odds ratio (OR) = (a×d) / (b×c) — case-control studies
— Relative risk (RR) = [a/(a+b)] / [c/(c+d)] — cohort/RCT
— Chi-square tells you whether there is an association; OR/RR tell you how strong and in which direction
Step 3 management (of test selection): Given a 2×2 table, first verify all expected counts ≥ 5. If yes → chi-square with df=1, reject null if χ² > 3.84 or p < 0.05. If any expected count < 5 → Fisher exact. If observations are paired → McNemar, using only the discordant pairs (b and c cells).

Choosing Chi-Square vs Alternatives — Diagnostic Algorithm

— Step 1: Confirm outcome variable is categorical (not continuous). If continuous, exit to t-test/ANOVA/regression

— Step 2: Determine number of groups and whether independent

— Step 3: Check expected cell counts and pairing

— Step 4: Choose ordinal trend test if categories are ordered

— 2 independent groups, binary outcome, expected ≥ 5 → chi-square (2×2) or equivalently z-test for two proportions

— 2 independent groups, binary outcome, expected < 5 in any cell → Fisher exact test

— Paired/matched binary data (same person before/after; matched pairs) → McNemar test

— ≥3 independent groups, binary or nominal outcome → chi-square R×C

— ≥3 groups, ordinal outcome with hypothesized trend → Cochran-Armitage trend test

— Stratified 2×2 tables (controlling for confounder) → Cochran-Mantel-Haenszel

— Repeated measures on same subjects across ≥3 time points (binary) → Cochran Q

— Chi-square vs z-test for two proportions: mathematically equivalent for 2×2; either is acceptable

— Chi-square vs logistic regression: chi-square = bivariate association; logistic = adjusts for multiple covariates

— Chi-square vs t-test: t-test is for means of continuous data, never for proportions

Board pearl: When a stem mentions "adjusted for age, sex, and comorbidities" with a categorical outcome, the test is logistic regression, not chi-square. Chi-square is unadjusted; once confounders enter the analysis, you move to multivariable models. This is a frequent Step 3 distractor among the answer choices.

Step-by-step test selection when faced with categorical outcome data:

Decision tree:

Common Step 3 confusions:

Worked Calculation — From Table to p-Value

• Worked example: RCT of new antibiotic vs placebo for cellulitis cure at 7 days
Cured	Not cured	Total
Drug	80	20	100
Placebo	60	40	100
Total	140	60	200
• Compute expected values (row total × column total / N):
— E(Drug, Cured) = (100 × 140)/200 = 70
— E(Drug, Not cured) = (100 × 60)/200 = 30
— E(Placebo, Cured) = (100 × 140)/200 = 70
— E(Placebo, Not cured) = (100 × 60)/200 = 30
• Compute χ² = Σ (O−E)²/E:
— (80−70)²/70 = 100/70 = 1.43
— (20−30)²/30 = 100/30 = 3.33
— (60−70)²/70 = 100/70 = 1.43
— (40−30)²/30 = 100/30 = 3.33
— Total χ² = 9.52
• Interpretation:
— df = (2−1)(2−1) = 1
— Critical value at α=0.05, df=1 = 3.84
— 9.52 > 3.84 → reject the null; p < 0.05 (actually p ≈ 0.002)
— Conclude: cure proportions differ significantly between drug and placebo
• Pair with effect size:
— Absolute risk reduction = 80% − 60% = 20%
— Number needed to treat = 1/0.20 = 5
— Relative risk = 0.80/0.60 = 1.33 (33% relative increase in cure)
Board pearl: Knowing χ²crit = 3.84 at df=1 lets you eyeball whether a 2×2 result is significant without a calculator. If the question gives χ² and df, compare directly. The exam rarely demands the full calculation — usually it tests interpretation of a given χ² and p-value.

Interpreting the p-Value and Confidence Intervals

— Probability of observing the data (or more extreme) assuming the null hypothesis is true

— NOT the probability that the null is true; NOT the probability the result is due to chance alone

— p < 0.05 → "statistically significant"; reject null

— p ≥ 0.05 → fail to reject; does not prove the null

— "The probability the drug doesn't work is 5%" → wrong

— "There is a 95% chance the drug works" → wrong

— "The result will be replicated 95% of the time" → wrong

— Correct: "If the drug truly had no effect, we'd see results this extreme < 5% of the time"

— Large N can render trivial differences statistically significant

— A χ² with p = 0.001 but ARR of 0.5% may not warrant practice change

— Always pair p-value with effect size (RR, OR, ARR, NNT) and 95% CI

— A 95% CI for the risk difference or OR that excludes 0 (for differences) or 1 (for ratios) implies p < 0.05

— CI gives precision; chi-square gives only a binary reject/accept decision

Key distinction: Chi-square answers "is there an association?" — it is a hypothesis test. Odds ratio, relative risk, and risk difference (with confidence intervals) answer "how strong is the association?" — these are effect measures. Step 3 questions love to ask which is appropriate when a clinician needs to counsel a patient on magnitude of benefit: the answer is the effect estimate, not the p-value.

p-value definition (must be airtight for Step 3):

Conventional thresholds:

Common Step 3 misinterpretations to recognize and eliminate:

Statistical vs clinical significance:

Confidence intervals complement chi-square:

Assumptions of the Chi-Square Test — and When They Break

• Core assumptions (memorize for the recognition stem):
— Independence of observations — each subject contributes to only one cell
— Mutually exclusive categories — no subject counted twice
— Random sampling from the population of interest
— Adequate expected frequencies — all expected counts ≥ 5 (more lenient: ≥80% of cells with E ≥ 5 and no cell with E < 1 in larger tables)
— Counts, not percentages or proportions, must be the input data
• Violations and their fixes:
— Paired data → McNemar test
— Small expected counts in 2×2 → Fisher exact test
— Small expected counts in R×C → Fisher-Freeman-Halton extension or collapse categories
— Repeated measures on same subjects → Cochran Q or GEE
— Need to adjust for confounders → logistic regression or Mantel-Haenszel stratified analysis
— Ordered categories with trend → Cochran-Armitage trend test
• Common Step 3 trap — using percentages:
— A table presenting "60% vs 80% cure" without underlying N cannot be tested — you need the actual counts
— If only percentages are given, the question often expects you to identify that insufficient data are presented for a valid chi-square
• Continuity correction:
— Yates correction subtracts 0.5 from	O−E	before squaring; reduces type I error in small 2×2 samples
— Now considered overly conservative; many modern texts skip it
— Step 3 rarely tests the math but may mention it as a refinement
Board pearl: When any expected cell count is < 5, switch to Fisher exact test. This shows up most often in small RCTs, rare adverse events, and pilot studies. The exam loves a 2×2 with cells like 1, 2, 3, 4 — that is a Fisher exact stem, not chi-square.

Specialized Variants — McNemar, Mantel-Haenszel, Fisher

— Used for pre/post on the same subject or matched pairs

— Examples: pre/post-intervention symptom status; matched case-control with binary exposure; diagnostic test comparison (test A vs test B on same patients)

— Uses only discordant pairs (b and c in 2×2): χ² = (b−c)²/(b+c), df=1

— Concordant pairs (both yes or both no) contribute no information about change

— Used when expected counts < 5 in 2×2 tables

— Calculates exact probability rather than approximating with χ² distribution

— Common in small surgical series, rare disease studies, pilot trials

— Tests association between exposure and outcome adjusting for a stratifying variable (confounder)

— Produces a pooled OR across strata

— Example: smoking → lung cancer, stratified by age decade

— If strata are heterogeneous (Breslow-Day test significant), CMH not appropriate — use logistic regression

— Detects linear trend across ordered categorical exposure levels (e.g., never/former/current smoker → MI risk)

— More powerful than generic R×C chi-square when trend is plausible

— Compares one sample's distribution to a theoretical one

— Hardy-Weinberg equilibrium, Mendelian ratios, expected demographic distributions

Step 3 management: Identify the data structure first. Paired = McNemar; small cells = Fisher; stratified = Mantel-Haenszel; ordered trend = Cochran-Armitage; goodness-of-fit to expected distribution = standard χ² goodness-of-fit. Memorize this five-item map and most biostatistics test-selection stems become trivial.

McNemar test — paired binary data:

Fisher exact test:

Cochran-Mantel-Haenszel (CMH):

Cochran-Armitage trend test:

Chi-square goodness-of-fit:

Special Population — Small Samples and Rare Events

— The χ² statistic approximates a continuous distribution; with small N, expected counts drop and the approximation fails, inflating type I error

— A "significant" χ² in a study with N = 20 may be a statistical artifact

— Any expected (not observed) cell count < 5 in a 2×2 table

— In larger R×C tables: any expected < 1, or > 20% of cells with expected < 5

— Fisher exact test — preferred for small 2×2; computationally exact

— Collapse categories — combine sparse columns (e.g., merge "severe" and "very severe") to boost expected counts; must be clinically justifiable, not data-dredged

— Exact tests for R×C — Fisher-Freeman-Halton extension

— Bayesian methods — used in modern small-trial design but rarely on Step 3

— Phase 1/2 trials with 20-40 patients

— Surgical morbidity reviews of uncommon procedures

— Outbreak investigations of small clusters

— Pediatric rare disease cohorts

— Power affects type II error (missing a real effect)

— Small expected cell counts affect type I error validity of the test itself

— Both can co-exist: a tiny study may use a valid Fisher test but still miss real effects

Board pearl: A stem showing a 2×2 like (1, 9 / 0, 10) — say, 0 vs 1 adverse event in two arms of 10 patients — is a Fisher exact question, not chi-square. The answer choice "chi-square test" is the distractor; recognizing expected counts < 5 is the testing point.

Why small samples wreck chi-square:

Defining "small":

Solutions hierarchy:

Rare adverse event scenarios common in Step 3 stems:

Don't confuse small sample with low power:

Special Population — Multiple Groups and Multiple Testing

— A significant R×C χ² tells you at least one pair of proportions differs — not which

— Performing pairwise 2×2 chi-squares inflates type I error

— With k = 3 groups, 3 pairwise tests at α=0.05 → family-wise error ≈ 14%

— Bonferroni: divide α by number of comparisons (3 comparisons → α=0.0167 each); conservative

— Holm-Bonferroni: stepwise, less conservative

— Benjamini-Hochberg (FDR): controls false discovery rate; common in genomics

— A trial with overall p = 0.06 that finds "significant" benefit in women (p=0.04) is performing multiple comparisons

— Always interpret subgroup χ² results as hypothesis-generating, not confirmatory

— Pre-specified subgroups (in protocol) carry more weight than post-hoc ones

— Formal way to ask whether effect differs across strata

— Tested with an interaction term in regression, not a chi-square comparison of subgroup p-values

Key distinction: A significant R×C chi-square means "the groups differ somewhere"; it does not identify which groups. Follow-up pairwise comparisons require multiplicity adjustment. On Step 3, when a stem lists three drug arms with one p-value, recognize that the omnibus test is significant but pairwise conclusions require additional analysis with corrected α.

R×C chi-square handles ≥3 groups or ≥3 outcome categories in one omnibus test

The multiple comparisons problem:

Corrections for post-hoc pairwise testing:

Subgroup analyses — a Step 3 favorite trap:

Interaction testing:

Complications — Misinterpretation and Statistical Pitfalls

— Huge studies (N=50,000) can detect trivial differences

— A χ² with p<0.001 and ARR of 0.2% may not justify treatment

— Chi-square shows two variables co-vary, nothing more

— Need study design (RCT > cohort > case-control) and Bradford Hill considerations to infer causation

— Aggregated 2×2 reverses direction when stratified by a third variable

— Classic UC Berkeley admissions example

— Mantel-Haenszel or logistic regression unmasks this

— Group-level associations may not hold at the individual level

— Common in public health data

— "50% relative reduction" sounds dramatic but may be 2% → 1% absolute

— Chi-square p-value gives no sense of magnitude

— Subjects with missing outcome dropped from chi-square — can bias if missingness is informative

— Intention-to-treat analyses handle this in RCTs

— Publication bias and selective subgroup reporting inflate apparent effects

Step 3 management: When a stem reports a "significant chi-square," reflexively ask:

— Was the outcome clinically important (NNT, ARR)?

— Are there confounders unaccounted for?

— Is this a subgroup or post-hoc finding?

— Is the CI wide, suggesting imprecision?

— Could chance, bias, or confounding still explain the result?

Statistical significance is necessary but not sufficient to change practice.

Pitfall 1: Statistical vs clinical significance

Pitfall 2: Confusing association with causation

Pitfall 3: Simpson's paradox

Pitfall 4: Ecological fallacy

Pitfall 5: Misusing relative vs absolute measures

Pitfall 6: Ignoring missing data

Pitfall 7: Reporting only the significant findings

When to Escalate — Moving from Chi-Square to Regression

— Need to adjust for confounders (age, sex, comorbidities, baseline severity)

— Multiple simultaneous predictors of interest

— Interest in interaction effects between variables

— Time-varying covariates

— Repeated measures or clustered data (patients within hospitals)

— Binary outcome + multiple predictors → logistic regression (yields adjusted OR)

— Time-to-event outcome → Cox proportional hazards (yields HR)

— Count outcome → Poisson or negative binomial regression

— Continuous outcome + predictors → linear regression

— Clustered/repeated data → GEE or mixed-effects models

— Adjusted odds ratios with 95% CI

— Ability to handle continuous and categorical predictors together

— Tests for effect modification (interaction terms)

— Predicted probability for individual patients

— Observational study finds smoking → MI via χ² (p<0.001)

— But smokers are older, more hypertensive, more diabetic

— Logistic regression adjusting for these covariates yields the true independent effect of smoking

— Chi-square alone is inadequate for causal inference in observational data

CCS pearl: In an outpatient research-interpretation stem, when the question asks "what additional analysis is needed" after a significant unadjusted association, the correct upgrade is almost always multivariable logistic regression (for binary outcomes) or Cox regression (for time-to-event). Chi-square is the screening test; regression is the definitive analysis.

Chi-square is bivariate and unadjusted. Real-world clinical research almost always requires multivariable adjustment. Recognize the escalation triggers:

Triggers to abandon chi-square in favor of regression:

Test-selection upgrade map:

What logistic regression adds over chi-square:

Step 3 scenario:

Key Differentials — Same-Category Statistical Tests for Categorical Data

— Mathematically equivalent to 2×2 chi-square

— Yields a z-score; z² = χ²

— Use either; results identical

— For small expected counts

— Computes exact p-value via hypergeometric distribution

— More conservative than chi-square in small samples

— Paired binary data

— Tests whether discordant pair proportions differ

— Common for diagnostic test comparison and pre/post studies

— Extension of McNemar to ≥3 repeated binary measures on same subjects

— Example: same patient evaluated by 3 different imaging modalities (positive/negative)

— Stratified 2×2 analysis; pooled OR adjusting for one confounder

— Used when sparse data prevent full logistic regression

— Ordered exposure categories; tests for linear trend in proportions

— Alternative formulation; same df and critical values

— Used in log-linear models for higher-dimensional contingency tables

— Not a chi-square per se, but produces a χ² statistic

— Compares survival curves between groups

— Different from chi-square: handles censored time-to-event data

Key distinction: All these tests yield a χ² statistic and a p-value, but they answer different questions. Step 3 stems exploit this — read the data structure (paired? stratified? ordered? censored?) to pick the right test, even when several answer choices "feel like" chi-square.

Tests that look similar to chi-square but serve different purposes:

Z-test for two proportions:

Fisher exact test:

McNemar test:

Cochran Q test:

Cochran-Mantel-Haenszel test:

Cochran-Armitage trend test:

Likelihood ratio chi-square (G-test):

Log-rank test:

Key Differentials — Tests for Non-Categorical Data

— Normally distributed → Student's t-test (independent samples)

— Paired/repeated on same subject → paired t-test

— Non-normal or ordinal → Mann-Whitney U (independent) or Wilcoxon signed-rank (paired)

— Normally distributed → one-way ANOVA

— Non-normal → Kruskal-Wallis test

— Repeated measures → repeated-measures ANOVA or Friedman test

— Linear association → Pearson correlation (parametric) or Spearman (rank-based)

— Predictive model → linear regression

— Kaplan-Meier curves for visualization

— Log-rank test for comparison

— Cox proportional hazards regression for adjusted hazard ratios

— Sensitivity, specificity, PPV, NPV from 2×2 tables (descriptive, not chi-square)

— ROC curve / AUC for discrimination

— Likelihood ratios for clinical decision-making

— Stem reports "mean HbA1c was 7.2 vs 7.8" — this is continuous, not categorical → t-test, not chi-square

— Stem reports "60% vs 75% achieved goal HbA1c <7.0" — this is binary → chi-square

— Same trial can be analyzed both ways depending on how outcome is defined

Board pearl: Decide test by outcome data type first, then number of groups, then independence. Means → t-test/ANOVA. Proportions → chi-square. Time-to-event → log-rank. Correlation between continuous variables → Pearson/Spearman. This four-bucket scheme handles ~90% of Step 3 biostats test-selection stems.

When the outcome is NOT categorical, chi-square is wrong. Recognize these alternatives:

Continuous outcomes, 2 groups:

Continuous outcomes, ≥3 groups:

Two continuous variables, relationship:

Time-to-event:

Diagnostic test performance:

Common Step 3 trap:

Secondary Application — Quality Improvement and Population Health

— Hand hygiene compliance before/after intervention → McNemar if same observers/units across time; chi-square if independent samples

— Readmission rates across hospital units → R×C chi-square

— Door-to-balloon time within 90 min: yes/no across two years → 2×2 chi-square

— Vaccination rates across clinics → chi-square

— Disease prevalence by demographic group → chi-square

— Outbreak investigation: exposed vs unexposed, ill vs not ill → 2×2 chi-square (often expanded with attack rates)

— Screening uptake by insurance status → chi-square

— Categorical outcome (received guideline-concordant care, yes/no) across race/ethnicity → chi-square for unadjusted; logistic regression for adjusted

— Pre/post payment-model change in proportion meeting quality benchmark → chi-square or McNemar depending on whether same providers

— Adherence to a new protocol across hospital wards → chi-square

— Use run charts and statistical process control (SPC) for longitudinal trends rather than repeated chi-squares

Step 3 management: For a QI project measuring a binary process or outcome metric (compliance, readmission, complication, immunization uptake) across two or more independent groups, the default analysis is chi-square. If measuring the same group over time (paired pre/post), use McNemar. If trending across many time points, use SPC charts or interrupted time-series analysis rather than chi-square.

Chi-square is the workhorse statistical test for QI and population health projects at the Step 3 level. Recognize these recurring vignettes:

Quality improvement examples:

Population health and epidemiology:

Health disparities research:

Pay-for-performance metrics:

Implementation science:

Follow-Up and Reporting — How Chi-Square Results Should Be Communicated

— Test name (and any variant: Fisher exact, McNemar, etc.)

— χ² value

— Degrees of freedom

— Exact p-value (not just "p<0.05")

— Effect size: OR, RR, ARR, or risk difference with 95% CI

— Sample sizes in each group

— "Cure rates differed between groups (80/100 [80%] vs 60/100 [60%]; χ²=9.52, df=1, p=0.002; risk difference 20% [95% CI 7–33%]; NNT 5)"

— Only "p<0.05" without χ² or CI → insufficient

— Percentages without denominators → cannot reproduce

— No effect size → reader cannot judge clinical importance

— CONSORT for RCT reporting requires both significance testing and effect estimates with CIs

— STROBE for observational studies similar

— Translate chi-square + effect size into shared decision-making language for patients

— "Out of every 100 patients treated, 20 more are cured than with placebo" beats "p < 0.002"

— Number needed to treat (NNT) and absolute risk reduction (ARR) are the patient-facing translations

— Patients value absolute risk reduction over relative risk reduction; literature shows informed-consent quality improves with absolute terms

CCS pearl: When asked to counsel a patient or family about a study result, do not quote the chi-square p-value. Instead, translate to absolute risk reduction and NNT, frame the time horizon, and acknowledge uncertainty (the confidence interval). This is the Step 3 communication standard for evidence-based shared decision-making.

Minimum reporting elements for a chi-square result in a clinical paper:

Example proper reporting:

Reporting failures to recognize on Step 3:

CONSORT and STROBE guidelines:

Clinical decision-making downstream:

Counseling tip:

Ethical, Legal, and Patient Safety Considerations

— Patients must understand magnitude and uncertainty of benefits/risks, not just statistical significance

— Quoting a p-value without ARR or NNT may violate the spirit of informed consent

— Studies show patients overestimate benefit when only relative risk reduction is presented — present absolute numbers

— Running many chi-squares and reporting only the significant findings is p-hacking — a research integrity violation

— Pre-registration of analyses (e.g., on ClinicalTrials.gov) is now standard for RCTs

— Post-hoc subgroup analyses must be labeled as hypothesis-generating

— Discharge medications and follow-up plans should be grounded in evidence with adequate effect size, not borderline significance from small studies

— A new drug with χ² p=0.04 but ARR of 0.5% may not warrant prescribing in a frail elderly patient at discharge

— Chi-square stratified by race/ethnicity reveals disparities but must be reported responsibly — avoid attributing differences to biology when social determinants are operative

— Outbreak investigations using chi-square (attack rate by exposure) often trigger public health reporting to local/state health departments

— Clinicians must recognize when a research-style 2×2 is actually a reportable cluster

— Industry-sponsored trials may emphasize relative over absolute risk; readers must demand both

Step 3 management: When counseling a patient on entering a clinical trial or accepting a new therapy, present (1) absolute risk reduction, (2) number needed to treat, (3) number needed to harm, and (4) uncertainty (95% CI). Reliance on p-values alone is ethically inadequate for informed consent and shared decision-making — a recurring Step 3 communication theme.

Statistical literacy is a patient safety issue. Misinterpretation of chi-square results can drive inappropriate treatment decisions, especially in transitions of care.

Informed consent and statistical communication:

Multiple testing and ethical research conduct:

Transitions of care and statistical evidence:

Health disparities:

Mandatory reporting and surveillance:

Conflicts of interest in reporting:

High-Yield Associations and Rapid-Fire Clinical Facts

Board pearl: If you only memorize one cutoff for Step 3 biostatistics, make it χ² > 3.84 with df=1 → p < 0.05. Combined with "counts → chi-square, means → t-test, time-to-event → log-rank," you can crack the majority of biostatistics test-selection items.

Chi-square = two categorical variables, independent samples, expected counts ≥ 5

Fisher exact = same setup, but small samples (expected < 5)

McNemar = paired binary data

Cochran-Mantel-Haenszel = stratified 2×2, adjusts for one confounder

Cochran-Armitage = trend across ordered exposure categories

Cochran Q = ≥3 paired/repeated binary measures

Goodness-of-fit χ² = observed vs theoretical distribution (Hardy-Weinberg, Mendelian)

df for R×C = (rows − 1)(columns − 1)

Critical χ² at α=0.05: df=1 → 3.84; df=2 → 5.99; df=3 → 7.81

χ² for 2×2 ≡ z² for two-proportion z-test

Chi-square gives p-value only; pair with OR/RR/ARR/NNT for effect size

Logistic regression = the multivariable upgrade for binary outcomes

Cox regression = the multivariable upgrade for time-to-event

A "significant" R×C χ² identifies that groups differ somewhere, not which pair

Subgroup analyses with chi-square require multiplicity adjustment

Simpson's paradox: stratified chi-square reverses aggregated result

Yates continuity correction: optional, more conservative; rarely tested

Percentages alone are insufficient — chi-square needs counts

Inputs to chi-square: observed counts, expected counts under independence

Output: χ² statistic and p-value, not effect size

Confidence intervals for risk difference or OR provide both significance (excluding null) and precision

Statistical significance ≠ clinical significance — always evaluate NNT/ARR

Power: large N detects small effects; small N may miss true effects (type II)

Type I error (α): false positive; conventionally 0.05

Type II error (β): false negative; power = 1−β, conventionally ≥ 0.80

Board Question Stem Patterns

— Stem provides cure rates in two groups as counts

— Asks: "Which is the most appropriate statistical test?"

— Answer: chi-square (or Fisher exact if expected counts < 5)

— Distractors: t-test, ANOVA, Pearson correlation, log-rank

— Stem reports χ²=12.3, df=1, p=0.0005

— Asks: "What does this mean?"

— Answer: groups differ significantly in the categorical outcome; reject null

— Distractors: "drug works in 99.95% of patients" (wrong interpretation of p)

— Stem describes "before and after the intervention in the same 50 patients"

— Asks for test choice

— Answer: McNemar, not chi-square

— Stem shows 2×2 with cells like 1, 9, 2, 8

— Answer: Fisher exact, not chi-square

— Three drug arms, binary outcome, one p-value reported

— Asks what additional analysis is needed

— Answer: post-hoc pairwise tests with multiplicity correction (Bonferroni)

— Observational study shows unadjusted χ² association

— Asks for next step

— Answer: multivariable logistic regression for adjusted OR

— Stem gives a significant χ² and asks how to counsel patient

— Answer: present absolute risk reduction and NNT, not p-value alone

— Genetics scenario comparing observed to Mendelian ratios

— Answer: chi-square goodness-of-fit

— Aggregate result differs from stratified result

— Answer: Cochran-Mantel-Haenszel or logistic regression with confounder

Step 3 management: For every biostatistics stem, identify in order: (1) outcome type (categorical/continuous/time-to-event), (2) number of groups, (3) independence/pairing, (4) sample adequacy, (5) need for adjustment. This five-step triage solves the majority of biostats items.

Pattern 1 — Test selection from 2×2 table:

Pattern 2 — Interpretation of given χ² and p:

Pattern 3 — Paired data trap:

Pattern 4 — Small-sample trap:

Pattern 5 — Multiple groups omnibus:

Pattern 6 — Adjustment for confounders:

Pattern 7 — Patient communication:

Pattern 8 — Goodness-of-fit:

Pattern 9 — Stratified analysis / Simpson's paradox:

One-Line Recap

The chi-square test compares observed versus expected counts to determine whether two categorical variables are associated, and its correct use depends on independent observations, adequate expected cell counts (≥5), and pairing the p-value with an effect size for clinically meaningful interpretation.

Core indication: two categorical variables, independent observations, counts (not means) in a contingency table; df = (r−1)(c−1); χ² > 3.84 at df=1 means p < 0.05

Critical alternatives: Fisher exact when expected counts < 5; McNemar for paired binary data; Mantel-Haenszel for stratified analysis; Cochran-Armitage for ordered trend; logistic regression when adjusting for confounders

Effect size always: chi-square yields significance only — always report and counsel with absolute risk reduction, NNT, OR/RR with 95% CI to translate statistics into clinical decisions

Step 3 application: workhorse test for QI projects, outbreak investigations, RCT primary outcomes, and health-disparities research; recognize the data structure (paired, stratified, ordered, small) to pick the correct variant; never confuse statistical with clinical significance, and never quote a p-value to a patient — translate it into absolute terms for informed, shared decision-making at every transition of care