Biostatistics & Population Health

Cluster randomized trial design and analysis

Clinical Overview and When to Suspect Cluster Randomization

— Intervention operates at group level: hand-hygiene protocols, ICU checklists, school-based vaccination campaigns, clinic-level EHR alerts, community water fluoridation.

— Avoid contamination: if you randomize individuals within one clinic to "shared decision-making visit" vs usual care, the same physician will leak the intervention to controls.

— Logistical/ethical feasibility: easier to train one whole unit; some interventions (policy changes) can't be delivered to individuals.

— Study population-level outcomes: herd immunity, transmission dynamics.

— Cluster-randomizing nursing homes to a fall-prevention bundle.

— Randomizing primary care practices to a depression collaborative-care model.

— Randomizing ICUs to a sepsis early-warning algorithm.

— Randomizing villages to bed-net distribution.

Board pearl: If the unit of randomization ≠ the unit of analysis (e.g., randomize clinics but analyze individual patients as if independent), the trial commits the "unit of analysis error" — standard errors are falsely small, p-values falsely significant, and the study is invalid as reported. Always check that the analysis accounts for clustering (mixed models, GEE, or cluster-level summary).

Definition: A cluster randomized trial (CRT) randomizes intact groups (clusters) — clinics, hospitals, schools, villages, dialysis units, nursing homes — rather than individual patients, to receive intervention vs control.

Why use a CRT instead of an individual RCT:

Common Step 3 / public-health examples:

Suspect a CRT on the exam when the stem says: "researchers randomized 12 clinics…", "hospitals were assigned…", "schools were allocated…", or describes an intervention delivered by a clinician/system to all patients seen.

Core trade-off: CRTs gain feasibility and reduce contamination but lose statistical efficiency because outcomes within a cluster are correlated — patients in the same clinic share providers, culture, and case-mix.

Presentation Patterns and Key History (Recognizing a CRT in a Stem)

— "24 primary care practices were randomized, 12 to a pharmacist-led hypertension protocol and 12 to usual care; 3,400 patients were followed for 1 year."

— "Investigators randomized 40 ICUs to a sepsis bundle vs standard care."

— "Villages in rural Kenya were randomly assigned to mass azithromycin vs placebo."

— Note the two sample sizes: number of clusters (k) and number of individuals (n).

— Parallel CRT: standard — half the clusters get intervention, half get control, run simultaneously.

— Cluster crossover: each cluster gets both intervention and control in different periods (washout needed).

— Stepped-wedge CRT: all clusters eventually receive the intervention; rollout is staggered in random order. Common for QI interventions where withholding indefinitely is unethical (e.g., universal MRSA screening).

— Individually randomized group-treatment trial (IRGT): individuals randomized, but treatment delivered in groups (e.g., group CBT) — still requires clustering adjustment.

— Mentions intraclass correlation coefficient (ICC) in the sample-size calculation.

— Mentions design effect or variance inflation factor.

— Uses mixed-effects / multilevel / hierarchical models or generalized estimating equations (GEE).

— Reports cluster-level baseline characteristics (clinic size, urban/rural, baseline outcome rate) — because baseline imbalance between clusters is the dominant threat.

Key distinction: A multicenter individual RCT randomizes patients within each site (site is a stratification variable, not the unit of randomization). A CRT randomizes the site itself. The exam often hides this — read the verb after "randomized" carefully: "randomized 30 clinics" (CRT) vs "randomized 3,000 patients across 30 clinics" (multicenter individual RCT).

Stem signatures that scream CRT:

Variants you must recognize:

History clues in methods section:

Structural Features and "Hemodynamics" of a CRT

— k = number of clusters per arm (the main driver of power).

— m = average cluster size (patients per cluster).

— ICC (ρ) = intraclass correlation coefficient, 0–1, measuring how similar outcomes are within a cluster relative to between clusters.

— ICC = (between-cluster variance) / (between-cluster variance + within-cluster variance).

— Typical primary-care outcome ICCs are small (0.01–0.05) but non-zero ≠ negligible.

— Higher ICC → patients within a cluster behave more alike → less independent information per patient → need more clusters.

— DE = 1 + (m − 1) × ICC

— Effective sample size = (total N) / DE.

— Example: ICC 0.02, m = 50 → DE = 1 + 49(0.02) = 1.98 → you need ~2× the patients of an individual RCT for equivalent power.

— Large clusters with even modest ICC dramatically erode power.

— Add more clusters (k) — best return on investment.

— Adding patients within existing clusters (m) gives diminishing returns beyond a point.

— Stratify or match clusters at randomization on baseline outcome rate, size, geography.

— Collect a baseline measure of the outcome → analyze with ANCOVA at cluster level.

Step 3 management (of the trial): When a CRT is underpowered, adding more sites trumps recruiting more patients per site. A trial with 4 clusters per arm and 500 patients each is far weaker than one with 20 clusters per arm and 100 patients each — even though total N is identical. Recognize this trade-off when a question asks how to improve a proposed CRT.

Three numbers define a CRT:

Intraclass correlation coefficient (ICC):

Design effect (DE) / variance inflation factor:

Power-boosting levers (in priority order):

Diagnostic Workup — Identifying Threats to Validity

— In many CRTs, clusters are randomized before individual patients are identified or consented. If the intervention is unblinded to recruiters, they may enroll different types of patients in intervention vs control clusters → differential selection.

— Mitigation: identify and enroll all eligible patients before cluster randomization, or use blinded recruiters, or use routinely collected data (registries, EHR) for all eligible patients.

— With few clusters (e.g., 6–10 per arm), chance imbalance in cluster-level characteristics is common — much more so than in patient-level RCTs.

— Mitigation: stratified or matched randomization, covariate-constrained randomization, or analytic adjustment.

— Patients move between clusters; providers may work at multiple sites.

— Mitigation: geographic separation, buffer zones, single-affiliation providers.

— Patients in control clusters may drop out if they perceive inequity.

— Often impossible to blind providers delivering the intervention. Blind outcome assessors and use objective endpoints (mortality, hospitalization, lab values) when possible.

— Number of clusters and patients at each stage (CONSORT flow diagram, cluster version).

— ICC and how clustering was handled in analysis.

— Method of cluster allocation and identification of participants.

Board pearl: The classic CRT pitfall on Step 3/epi questions is post-randomization recruitment bias — if a primary care clinic knows it's in the "intervention arm," its nurses may preferentially enroll healthier (or sicker) patients. This is essentially a CRT-specific form of selection bias that patient-level RCTs do not have, because patient-level RCTs randomize after consent.

Selection bias / recruitment bias (post-randomization identification):

Baseline imbalance:

Contamination across clusters:

Differential attrition:

Loss of blinding:

CONSORT extension for CRTs mandates reporting:

Advanced / Confirmatory Methodology — Stepped-Wedge and Crossover Designs

— All clusters start in the control state; at randomly assigned steps (time points), clusters cross over to intervention one-by-one (or in groups) until all are receiving it.

— Each cluster contributes data in both control and intervention periods → within-cluster comparison + between-cluster.

— Indicated when: intervention is believed effective and withholding is ethically uncomfortable; logistical rollout is staggered anyway; small number of clusters.

— Threat: secular trends — outcomes may improve over time independent of intervention (e.g., national QI campaigns). Analysis must include time as a fixed effect in the mixed model.

— Each cluster receives both intervention and control in different periods with washout between. Useful for short-acting interventions (e.g., shift-level protocols in EDs).

— Generate many possible allocations; keep only those that balance pre-specified cluster-level covariates (size, baseline rate, urban/rural); randomly draw the final allocation from the constrained set.

— Particularly useful when k is small (<20 per arm) and simple randomization can't be trusted to balance.

— Enroll broad populations (real-world eligibility).

— Use routinely collected outcomes (claims, EHR, registries).

— Align with PRECIS-2 pragmatic framework.

Key distinction: In a parallel CRT, the comparison is between intervention and control clusters at the same time. In a stepped-wedge CRT, the comparison is largely within clusters across time (before vs after their crossover), which is why time-period effects must always be modeled — otherwise a secular improvement is mistaken for an intervention effect.

Stepped-wedge cluster randomized trial (SW-CRT):

Cluster crossover trial:

Covariate-constrained (restricted) randomization:

Pragmatic features common in CRTs:

Sample Size and First-Line Analytic Logic

— Compute individual-RCT sample size first → multiply by design effect (DE) = 1 + (m − 1)ρ.

— Then divide by cluster size to get k (number of clusters per arm).

— Individual RCT needs 800 patients (400/arm). m = 40, ICC = 0.03 → DE = 1 + 39(0.03) = 2.17 → adjusted N ≈ 1,736 → 22 clusters per arm.

— Cluster-level analysis (aggregate each cluster to a single mean/proportion, then t-test or weighted regression): most robust when k is small (<15–20 per arm).

— Generalized estimating equations (GEE): population-average effects; requires ~≥40 clusters total for reliable SEs (small-sample corrections needed otherwise — Kauermann-Carroll, Fay-Graubard).

— Mixed-effects (multilevel) models with random intercept for cluster: subject-specific effects; flexible, handles missing data better; preferred when k is moderate-to-large.

— Estimated and observed ICC.

— Effect estimate with CI adjusted for clustering.

— Both intention-to-treat (at the cluster level) and per-protocol where relevant.

Step 3 management: If a question shows a CRT analyzed with an ordinary t-test or chi-square on individual patients, the correct critique is that standard errors are underestimated → p-values and CIs are spuriously narrow → results are not interpretable. The fix: re-analyze with a method that accounts for the intraclass correlation (mixed model, GEE, or cluster-level summary).

Sample size formula (parallel CRT, continuous outcome):

Worked example:

Three analytic strategies — pick based on cluster count:

What to report (CONSORT-CRT):

"Pharmacotherapy" — Choosing the Right Analytic Model

— Linear mixed model: random intercept for cluster; fixed effects for treatment, baseline value, stratification variables.

— Cluster-level analysis: compute mean change per clinic, then unpaired t-test across clinics (df = k − 2).

— Logistic mixed model (random intercept) → cluster-specific OR.

— GEE with logit link and exchangeable working correlation → population-average OR (often the more clinically interpretable estimate for public health).

— Cluster-level analysis: compute event rate per cluster → analyze rates.

— Poisson or negative binomial mixed model with cluster random effect and offset for person-time.

— Frailty (shared frailty) Cox model — frailty term is the cluster random effect.

— Use t-distribution with k − 2 degrees of freedom for cluster-level tests.

— Apply Kenward-Roger correction in linear mixed models.

— Use bias-corrected sandwich estimators for GEE.

— Always include categorical time (step) as a fixed effect to adjust for secular trends.

— Consider time-by-treatment interaction if effect may grow with implementation maturity.

Board pearl: The number that determines analytic options is k (clusters per arm), not n (patients). With k = 6 per arm, you essentially have 6 vs 6 observations at the cluster level — most stems that imply "thousands of patients = big study" are wrong if there are only a handful of clusters. Power, CI width, and analytic choice all hinge on k.

Continuous outcome (e.g., systolic BP):

Binary outcome (e.g., hospitalization, mortality):

Count/rate outcome (e.g., infections per 1,000 patient-days):

Time-to-event:

Small-sample corrections (when k < 30–40 total):

Stepped-wedge specific:

Implementation — Randomization, Allocation, and Enrollment Sequence

— Step 1: Define eligibility at both cluster and patient levels.

— Step 2: Identify and recruit clusters; obtain cluster (organizational) consent — sometimes called "gatekeeper" consent (medical director, IRB at each site).

— Step 3: When possible, identify and consent individual participants BEFORE cluster randomization, or commit to enrolling all eligible patients via routine data.

— Step 4: Randomize clusters (simple, stratified, matched, or covariate-constrained).

— Step 5: Deliver intervention; collect outcomes; analyze accounting for clustering.

— Generate allocation centrally; reveal only after cluster enrollment is finalized.

— Use stratified randomization by region or cluster size to minimize chance imbalance.

— Cluster-level consent alone may be acceptable for low-risk, system-level interventions (e.g., changing default EHR order sets), with waiver of individual consent by IRB.

— Individual consent still required for collection of identifiable data, surveys, or biospecimens.

— Disclose at the right time — premature unblinding of cluster allocation to recruiters causes the identification/recruitment bias described earlier.

— Use registries, claims, EHR, vital statistics — minimizes differential ascertainment.

— Pre-specify the primary outcome and analytic model in a published protocol/SAP.

CCS pearl: Think of a CRT like a multi-hospital protocol rollout — you wouldn't let each ICU decide mid-study whether to enroll its "sickest" or "healthiest" patients, just as you wouldn't let one nurse pick which patients get the new sepsis bundle. Lock in the enrollment pathway before allocation is revealed, then let the system run.

Recommended sequence (to prevent selection bias):

Allocation concealment at the cluster level:

Consent considerations (CRT-specific):

Pragmatic data collection:

Special Populations — Few-Cluster and Unequal-Cluster Scenarios

— Standard mixed models and GEE produce anti-conservative CIs.

— Preferred analyses:

— Cluster-level summary t-test with k − 2 df.

— Permutation tests based on the cluster allocation distribution — exact and robust.

— Small-sample-corrected GEE (e.g., Fay-Graubard, Mancl-DeRouen).

— Sample-size: aim for ≥4 clusters per arm minimum; below that, treat as a quasi-experiment.

— Design effect formula adjusts to: DE = 1 + [(CV² + 1) × m̄ − 1] × ρ, where CV is coefficient of variation of cluster size.

— Unequal sizes further reduce effective sample size.

— Mitigation: cap maximum cluster contribution, or weight cluster-level analyses by size with caution.

— Safety-net hospitals, rural clinics, low-resource sites may have higher baseline event rates and larger between-cluster variance (higher ICC) → power suffers most here.

— Stratify randomization by safety-net status; report subgroup ICCs.

— Cluster-level missingness (a whole site drops out) is far more damaging than patient-level missingness — one lost cluster ≈ losing one observation in the cluster-level analysis.

Key distinction: In an individual RCT, losing 5% of patients is a minor issue. In a CRT with 8 clusters per arm, losing 1 cluster = losing 12.5% of your effective sample and possibly unbalancing baseline characteristics catastrophically. Pre-specify strategies for cluster retention (site PI engagement, simple data flows) the way you'd pre-specify ITT for individual dropouts.

Few clusters (k < 10 per arm):

Highly variable cluster sizes (m ranges widely):

"Renal/hepatic equivalent" — vulnerable clusters:

Missing data:

Special Populations — Pragmatic, Vaccine, and Global-Health CRTs

— CRTs capture both direct (individual protection) and indirect (herd) effects — individual RCTs cannot.

— Classic example: village-randomized azithromycin for childhood mortality (MORDOR trial); community-randomized HPV vaccination studies.

— Design must consider transmission dynamics, geographic buffers, and fade-out of indirect effects over time.

— Hospital-randomized checklist trials, sepsis bundles, antibiotic stewardship — usually CRTs because intervention is unit-level.

— PRECIS-2 wheel: eligibility, recruitment, setting, organization, flexibility (delivery/adherence), follow-up, primary outcome, primary analysis — all rated for pragmatism.

— Classrooms or schools are the cluster (e.g., obesity-prevention curricula). Children within a classroom share teachers, peer influence → high ICC.

— Randomize practices, ACOs, or counties to value-based payment models, screening reminders, or care-coordination programs.

— Important for Step 3 systems-based questions: CMS demonstration projects are often CRTs.

— Community engagement; consent of community leaders does not replace individual consent where required; equitable distribution post-trial.

Board pearl: When the exam asks why a vaccine trial used community randomization rather than individual randomization, the answer is usually to capture herd-immunity effects and avoid contamination (vaccinated and unvaccinated children in the same school protect each other), giving a more accurate estimate of population-level effectiveness rather than per-protocol efficacy.

Vaccine and infectious-disease CRTs:

Pragmatic implementation/QI trials:

Pediatric and school-based CRTs:

Health-system policy CRTs:

Ethical considerations specific to global health:

Complications — Common Analytic and Interpretive Errors

— Randomize clusters, analyze patients as independent → falsely narrow CIs, inflated type I error. The single most-tested CRT flaw.

— Without a time term, an underlying improvement (e.g., national initiative) is attributed to the intervention.

— With 6 clusters per arm, a single outlier site (e.g., academic medical center mixed with community clinics) can drive results.

— Post-randomization differential enrollment.

— If urban clinics happened to be randomized to intervention and rural to control, urban vs rural differences masquerade as effect.

— Not reporting ICC → readers can't appraise power.

— Reporting only individual-level p-values without clustering adjustment.

— Missing CONSORT-CRT cluster flow diagram.

— Highly engaged participating clusters may not represent average practice → effects shrink on real-world rollout (the "volunteer site" effect).

— Many CRTs publish "negative" findings that are actually inconclusive due to too few clusters.

Step 3 management: When critiquing a CRT, run a mental checklist: (1) Was clustering acknowledged in sample size? (2) Was clustering handled in analysis (mixed model/GEE/cluster summary)? (3) Was the ICC reported? (4) Was identification of participants done before or after allocation, and by whom? (5) For stepped-wedge: was time adjusted for? Missing any of these = high risk of bias.

Unit of analysis error:

Ignoring secular trends in stepped-wedge designs:

Baseline imbalance amplified by small k:

Recruitment/identification bias:

Cluster-level confounding mistaken for treatment effect:

Reporting failures:

External validity:

Type II error in underpowered CRTs:

When to Escalate — Choosing CRT vs Alternative Designs

— Intervention is inherently delivered at group level (policy, EHR, environment).

— Contamination between treated and untreated individuals in the same setting is unavoidable.

— Outcome of interest is at the population level (transmission, herd effect, system performance).

— Intervention can be cleanly delivered to individuals (a new drug given by pill).

— No meaningful contamination risk.

— Power and efficiency are paramount — individual RCTs are far more efficient per patient.

— Logistical rollout is staggered anyway.

— Equipoise is partial — strong belief in benefit makes simultaneous withholding hard to justify.

— Small number of available clusters → within-cluster comparison gains efficiency.

— Intervention is short-acting and reversible (e.g., shift-based ED protocols).

— Washout is feasible.

— Randomization is impossible — natural experiments, policy rollouts.

— Causal inference will be weaker but feasible.

CCS pearl: Think of design choice like triage: individual RCT is the "floor" (efficient, default), CRT is the "step-down unit" (needed when contamination or group-level delivery forces it), stepped-wedge is the "ICU rollout" (used when staggered implementation + equipoise constraints dominate), and quasi-experimental is the "comfort care" of causal inference — last resort when nothing can be randomized.

Use a CRT when:

Do NOT use a CRT (use individual RCT) when:

Consider a stepped-wedge CRT when:

Consider a cluster crossover when:

Consider quasi-experimental (interrupted time series, difference-in-differences) when:

Escalate to biostatistician early — CRT design choices are not patchable after the fact.

Key Differentials — Other Cluster-Aware Designs

— Patients randomized within sites; site is a stratification variable, not the unit of randomization.

— Analyzed with site as a fixed or random effect, but ICC is typically negligible because randomization breaks within-site correlation.

— Most efficient design when contamination isn't a concern.

— Patients individually randomized but treated in groups (group therapy, surgical team learning curves).

— Within arm, patients clustered by therapy group/surgeon → must adjust for post-randomization clustering.

— Each cluster receives both conditions in random order.

— Staggered unidirectional crossover from control to intervention.

— Single patient, multiple crossovers — opposite end of granularity spectrum.

— REMAP-CAP and similar — usually individual-randomized, but pragmatic and multi-site.

Key distinction: "Multicenter" ≠ "cluster randomized." A 50-site trial that randomizes 5,000 individual patients across those sites is a multicenter individual RCT — sites are blocks/strata, not randomization units. A 50-site trial that randomizes the sites themselves (25 to intervention, 25 to control) is a CRT. The verb after "randomized" tells you everything. On Step 3, watch for stems like "clinics were randomly assigned" (CRT) vs "patients at 30 clinics were randomly assigned" (multicenter RCT) — these are almost identical sentences with completely different implications for analysis and interpretation.

Multicenter individual RCT:

Individually randomized group-treatment trial (IRGT):

Cluster crossover trial:

Stepped-wedge CRT:

N-of-1 trial:

Adaptive platform trials with cluster components:

Key Differentials — Non-Randomized Group-Level Designs

— Single group with repeated outcome measurements before and after an intervention.

— Strong design when randomization is impossible (e.g., statewide policy change).

— Threats: concurrent events, regression to the mean, secular trends.

— Analyzed with segmented regression — estimates level change and slope change at intervention.

— Two (or more) groups, one exposed to intervention, both followed over time.

— Key assumption: parallel trends before intervention.

— Common in health-policy research (e.g., Medicaid expansion effects).

— Stepped-wedge randomizes the order of crossover → causal inference.

— ITS does not randomize → causal inference relies on assumptions.

— Weakest — confounded by everything that changed over time. Common in low-quality QI literature.

— Non-randomized comparison of clusters (e.g., hospitals that adopted vs didn't adopt a protocol). Selection bias dominates — adopters differ systematically.

— Useful for prevalence but not for intervention evaluation.

Board pearl: A stepped-wedge CRT is randomized and analyzed with adjustment for time; an interrupted time series is not randomized and is analyzed as segmented regression. If the stem says "the order in which clinics adopted the intervention was randomly assigned," it's a stepped-wedge CRT. If the stem says "we observed outcomes before and after the hospital adopted the protocol," it's an ITS. The randomization verb is the decisive clue.

Interrupted time series (ITS):

Difference-in-differences (DiD):

Stepped-wedge vs ITS:

Pre-post (before-after) with no control:

Quasi-experimental cluster designs:

Cross-sectional cluster surveys:

"Secondary Prevention" — Pre-Specification, Protocols, and Reporting

— Pre-register on ClinicalTrials.gov with the cluster as the randomization unit clearly indicated.

— Publish full protocol and Statistical Analysis Plan (SAP) before unblinding.

— Mandatory items include: rationale for cluster design, how clustering was addressed in sample size and analysis, ICC reported, two-level flow diagram (clusters and individuals), recruitment and identification sequence, who gave consent (cluster vs individual).

— Primary outcome and analytic model (with clustering adjustment method).

— Subgroup analyses at the cluster level vs individual level.

— Handling of missing data and dropped clusters.

— ICC assumption used in power calculation (and plan to report observed ICC).

— Share de-identified cluster-level and individual-level data when possible; depositing analysis code allows replication of clustered analyses.

— Sustainability of cluster-level interventions often declines after trial end — plan implementation-science follow-on studies.

Step 3 management: When the exam shows a published CRT and asks "what is most important to confirm before applying these results in practice?", the high-yield answer is usually one of: (1) clustering was accounted for in analysis, (2) ICC and design effect are reported, (3) the participating clusters resemble your real-world setting, or (4) the result remained significant after adjustment for cluster-level baseline imbalance.

Protocol publication:

CONSORT extension for cluster trials (CONSORT 2012 extension):

SPIRIT extension for CRT protocols — parallel pre-trial reporting standard.

Pre-specify:

Data sharing and reproducibility:

Long-term follow-up:

Follow-Up, Monitoring, and Quality Metrics for CRTs

— Track cluster-level fidelity: did each intervention site actually implement the protocol? Heterogeneous implementation dilutes effect size.

— Monitor contamination: are control clusters adopting elements of the intervention? Survey providers and check EHR for protocol-related orders.

— Track cluster retention — site dropout is the CRT analog of "loss to follow-up."

— Pre-specify interim analyses with cluster-adjusted test statistics. Stopping boundaries (O'Brien-Fleming, Pocock) apply but require cluster-corrected information fraction.

— Watch for cluster-level harms (e.g., one nursing home with unusual adverse-event rate).

— Compare observed vs assumed ICC; if observed is much higher, true power was lower than planned.

— Reach, Effectiveness, Adoption, Implementation, Maintenance (RE-AIM framework) — especially for pragmatic CRTs.

— Are clusters serving disadvantaged populations represented? Does the intervention effect differ across cluster subgroups?

CCS pearl: Treat CRT monitoring like rounding on a ward: each cluster is a "patient" you need to keep alive in the study. A site that stops returning data is the equivalent of a patient lost to follow-up at the bedside — except losing one cluster can cost you 10–20% of your effective sample size. Build redundancy: site champions, simple data flows, regular check-ins, contingency plans for staff turnover.

Process monitoring during trial:

Data Safety Monitoring Board (DSMB):

Reporting observed ICC:

Implementation outcomes (post-trial):

Equity monitoring:

Ethical, Legal, and Patient Safety Considerations

— Cluster-level (organizational) consent — from medical director, hospital IRB, school board — authorizes site participation but does not substitute for individual informed consent when individuals face more than minimal risk or when identifiable data are collected.

— Ottawa Statement on the Ethical Design and Conduct of CRTs is the standard reference: distinguishes research participants (patients exposed to the intervention) from research subjects (those providing identifiable data) and outlines when waivers of individual consent are appropriate.

— Intervention is delivered at the system level (e.g., default EHR order, hand-hygiene poster).

— Risk is minimal and no more than usual care.

— Practicability of obtaining individual consent is low and waiver does not adversely affect rights/welfare.

— Genuine uncertainty about effectiveness must exist; if intervention is strongly believed beneficial, stepped-wedge may be more ethical than parallel CRT.

— Adverse events identified through cluster-level interventions (e.g., a new sepsis protocol leading to acute kidney injury) must be reported through the DSMB and to IRBs at each site — multi-IRB coordination is a real Step 3 systems issue.

— When the trial ends, intervention clusters often stop the intervention abruptly. Pre-specify a sustainability or de-implementation plan so patients aren't stranded mid-protocol (e.g., abrupt withdrawal of a care-coordination nurse).

— Post-trial access — if intervention works, plan dissemination to control clusters.

Board pearl: A school-based vaccination CRT may use parental opt-out consent with IRB waiver — but a CRT collecting children's biospecimens requires explicit individual/parental consent, regardless of cluster-level approval. The level of consent required tracks the level of individual risk and identifiability, not the level of randomization.

Cluster ("gatekeeper") consent vs individual consent:

Waiver of individual consent — appropriate when:

Equipoise at the cluster level:

Mandatory reporting and safety:

Transition-of-care risk:

Equity:

High-Yield Associations and Rapid-Fire Clinical Facts

Key distinction: "Multicenter" = many sites doing individual randomization. "Cluster randomized" = sites themselves are randomized. "Stepped-wedge" = randomized order of crossover from control to intervention. "Interrupted time series" = no randomization at all, segmented regression around an event. Four designs, four very different inferential strengths — recognize them by the verbs in the stem.

Design effect (DE) = 1 + (m − 1) × ICC. Memorize.

Effective sample size = total N / DE.

ICC = within-cluster correlation; typical primary-care outcomes 0.01–0.05; ICU outcomes 0.05–0.20; classroom behaviors 0.10–0.30.

Number of clusters (k) drives power more than cluster size (m). Beyond m ≈ 1/ICC, diminishing returns.

CONSORT 2012 cluster extension = mandatory reporting framework.

Ottawa Statement = ethical framework for CRTs (gatekeeper consent, waiver criteria).

Stepped-wedge = all clusters eventually get intervention; must adjust for calendar time.

Recruitment bias = post-randomization differential enrollment; CRT-specific.

Mixed-effects model with random intercept for cluster = default analytic approach.

GEE = population-average effects; needs ≥40 clusters or small-sample correction.

Cluster-level analysis (t-test on cluster means) = robust when k is small.

Covariate-constrained randomization = solution for chance imbalance with few clusters.

PRECIS-2 wheel = pragmatism rating tool, often paired with CRT methodology.

RE-AIM = implementation framework for translating CRT results to practice.

Unit of analysis error = single most-tested CRT critique.

MORDOR, Stop-CRC, IDEA, EPOCH = well-known published CRTs.

SW-CRT secular trend = always model time.

Equipoise at cluster level required.

Cluster dropout ≈ catastrophic loss in small-k trials.

Board Question Stem Patterns

— Stem: "Investigators randomized 24 primary care clinics to a depression collaborative-care model vs usual care…"

— Answer: cluster randomized trial.

— Trap: confusing with multicenter individual RCT.

— Stem: CRT analyzed with chi-square treating all 3,000 patients as independent → p = 0.02.

— Answer: unit of analysis error; standard errors too small; must adjust for intraclass correlation (mixed model, GEE, or cluster-level summary).

— Given m = 50 and ICC = 0.02, DE = 1 + 49(0.02) = 1.98; effective N halved.

— Stem: hand-hygiene campaign or school vaccination program.

— Answer: intervention is delivered at the group level and/or contamination would bias an individual RCT; also captures herd/system effects.

— Stem: clinics knew their assignment before enrolling patients, and intervention clinics enrolled younger, healthier patients.

— Answer: post-randomization recruitment/identification bias — selection bias unique to CRTs.

— Stem: outcomes improved in all clusters over time; intervention effect appears modest.

— Answer: must adjust for secular time trends; otherwise bias overstates effect.

— Stem: hospital-wide default EHR order set for VTE prophylaxis randomized across hospitals — should individual patients consent?

— Answer: usually IRB grants waiver of individual consent (minimal risk, system-level), but organizational/gatekeeper consent required.

Step 3 management: On any methods question, first identify the unit of randomization in one sentence; then ask whether the unit of analysis matches. Mismatch = the answer is almost always "account for clustering." This single heuristic handles a large fraction of Step 3 biostatistics CRT questions.

Pattern 1 — "Spot the design":

Pattern 2 — "Critique the analysis":

Pattern 3 — "Calculate design effect":

Pattern 4 — "Why use a CRT here?":

Pattern 5 — "Threat to validity":

Pattern 6 — "Stepped-wedge interpretation":

Pattern 7 — "Consent question":

One-Line Recap

A cluster randomized trial randomizes intact groups (clinics, hospitals, schools, communities) rather than individuals, trading statistical efficiency for the ability to study group-level interventions without contamination — and its validity hinges on accounting for intracluster correlation in both sample-size planning and analysis.

Board pearl: When in doubt on any CRT exam question, the correct critique is almost always "the analysis did not appropriately account for clustering" — recognize the design, compute or estimate the design effect, and demand a clustering-adjusted estimate before believing the p-value.

Design recognition: verb after "randomized" tells you the unit — "clinics were randomized" = CRT; "patients at clinics were randomized" = multicenter individual RCT.

Core math: Design effect = 1 + (m − 1) × ICC; effective N = total N / DE; add clusters (k), not patients (m), to boost power.

Analytic must: mixed-effects model, GEE, or cluster-level summary — never treat patients as independent. The unit of analysis must match (or properly adjust for) the unit of randomization.

Variants: parallel CRT (default), stepped-wedge (randomized rollout order, all eventually treated, adjust for time), cluster crossover (short-acting, reversible interventions).

CRT-specific biases: post-randomization recruitment bias, baseline cluster imbalance (worst when k is small), contamination across clusters, cluster dropout, secular trends in stepped-wedge designs.

Reporting standard: CONSORT 2012 cluster extension — two-level flow diagram, ICC reported, clustering method described.

Ethics anchor: Ottawa Statement — gatekeeper/organizational consent plus IRB waiver of individual consent acceptable for low-risk, system-level interventions; individual consent still required for identifiable data or above-minimal-risk interventions.