Biostatistics & Population Health
Epidemiologic measures: incidence, prevalence, mortality
— Incidence = new cases over a defined time period ÷ population at risk (dynamic, forward-looking)
— Prevalence = total existing cases at a point (or over an interval) ÷ total population (static snapshot)
— Mortality = deaths ÷ population over time (a specific kind of incidence, where the "event" is death)

— "10,000 healthy adults followed for 5 years; 200 developed colon cancer"
— Calculate: cumulative incidence = 200/10,000 = 2% over 5 years
— Or incidence rate = 200 / (person-years of follow-up), often expressed per 1,000 person-years
— "On January 1, 2024, a survey of 5,000 adults in County X found 400 with diabetes"
— Point prevalence = 400/5,000 = 8%
— Period prevalence adds anyone who had the disease at any time during the period
— "Of 500 patients diagnosed with pancreatic cancer, 450 died within 1 year"
— Case fatality rate = 450/500 = 90% (denominator = diseased, not total population)
— Mortality rate = deaths ÷ total population at risk (different denominator!)
— Case fatality = how deadly the disease is once you have it (denominator: cases)
— Mortality rate = how much the disease burdens the whole population (denominator: everyone)
— Ebola has high case fatality (~50%) but low US mortality rate (rare). Coronary disease has modest case fatality but enormous US mortality rate (common).

```
Disease+ Disease–
Exposed/Test+ A B
Unexposed/Test– C D
```
— Incidence in exposed = A / (A+B)
— Incidence in unexposed = C / (C+D)
— Relative risk (RR) = [A/(A+B)] ÷ [C/(C+D)] — cohort studies
— Odds ratio (OR) = (A×D)/(B×C) — case-control studies, approximates RR when disease is rare
— Attributable risk (AR) = incidence(exposed) – incidence(unexposed)
— Prevalence = (A+C) / (A+B+C+D) at a point in time
— Incidence = "heart rate" — rate of new events
— Prevalence = "blood volume" — total burden currently circulating
— Mortality = "exit rate" — how fast the pool drains via death
— Cure/recovery = the other exit from prevalence

— Cumulative incidence (incidence proportion, "risk"):
— Formula: new cases ÷ population at risk at start of period
— Unitless proportion (0–1 or %)
— Assumes everyone followed the full period — bad when there's heavy loss to follow-up or competing risks
— Example: 50 new MIs among 1,000 men followed 10 years → 5% cumulative incidence
— Incidence rate (incidence density):
— Formula: new cases ÷ total person-time at risk
— Units: cases per person-year (or per 1,000 person-years)
— Handles variable follow-up, dropouts, late entries
— Example: 50 MIs in 8,500 person-years → 5.88 per 1,000 person-years
— Stop the clock when a person develops the outcome, dies, or is lost to follow-up
— Don't double-count: a person can contribute time to the "at risk" denominator only while they remain at risk and disease-free
— Attack rate = incidence during an outbreak (often used for foodborne illness, infectious disease investigations); denominator is the exposed group
— Secondary attack rate = new cases among contacts of a primary case ÷ susceptible contacts; gauges transmissibility

— Point prevalence: existing cases at one moment ÷ population at that moment
— Best for chronic, stable conditions (HTN, DM, schizophrenia)
— Period prevalence: cases existing at any point during an interval ÷ average population
— Useful for episodic conditions (migraine, asthma exacerbations, major depressive episodes over the past year)
— Lifetime prevalence: proportion who have ever had the condition (depression, substance use disorders) — never decreases over a person's lifetime
— Steady-state approximation: P / (1–P) ≈ Incidence rate × Average duration
— When P is small (<10%): P ≈ I × D
— Implication: anything that prolongs disease duration (better treatment that doesn't cure — e.g., HIV ART, heart failure GDMT) raises prevalence even as incidence is stable or falling
— Improved survival from cancer → ↑ prevalence even if ↓ mortality and stable incidence
— Earlier diagnosis (screening) → ↑ prevalence via lengthened apparent duration
— A cure → ↓ prevalence (people exit the pool via recovery)
— A more lethal disease variant → ↓ prevalence (faster exit via death)
— PPV = (sens × prev) / [(sens × prev) + (1–spec)(1–prev)]
— High prevalence → high PPV; low prevalence → low PPV regardless of test quality
— This is why screening rare diseases yields many false positives even with "good" tests

— "How common is this disease right now?" → Prevalence
— "What's the risk of getting it?" → Cumulative incidence
— "What's the rate of new cases per person-time?" → Incidence rate
— "How deadly is it once you have it?" → Case fatality
— "How much does this disease kill the population?" → Mortality rate
— "Did the exposure cause disease?" → RR or OR
— "How much disease would disappear if we removed the exposure?" → Attributable risk / Population attributable risk
— Crude rate = total events / total population (no adjustment)
— Age-adjusted (standardized) rate = what the rate would be if the population had a standard age distribution
— Used to compare populations with different demographics (e.g., Florida vs Alaska cancer rates)
— Direct standardization: apply observed age-specific rates to a standard population
— Indirect standardization: apply standard age-specific rates to your population → yields Standardized Mortality Ratio (SMR) = observed deaths ÷ expected deaths
— SMR = 1.0 → same mortality as reference
— SMR > 1.0 → excess mortality (e.g., occupational cohort vs general population)
— SMR < 1.0 → "healthy worker effect" common in employed cohorts

— RR = incidence(exposed) / incidence(unexposed)
— RR = 1: no association; >1: harmful exposure; <1: protective
— Confidence interval crossing 1 = not statistically significant
— OR = (A×D)/(B×C)
— Approximates RR when disease is rare (<10%)
— Overestimates RR when disease is common — a Step 3 trap
— Instantaneous risk ratio over time; interpretation similar to RR
— ARR = incidence(control) – incidence(treated)
— Drives NNT = 1/ARR
— RRR = (incidence_control – incidence_treated) / incidence_control = 1 – RR
— Drug ads love RRR because it sounds bigger ("50% reduction!") even when ARR is tiny (2% → 1%)
— AR = incidence(exposed) – incidence(unexposed)
— AR% = AR / incidence(exposed) = (RR–1)/RR
— Tells you the proportion of disease in the exposed attributable to exposure
— PAR = incidence(total population) – incidence(unexposed)
— Reflects what fraction of all disease in the population is due to the exposure — drives smoking cessation, BP control campaigns

— e.g., US cardiovascular mortality ~ 220 per 100,000/year
— Disease-severity metric, not population-burden metric
— Does NOT measure risk; useful for descriptive epidemiology only
— Beware: a cause can have high proportionate mortality just because other causes are rare
— Highlights diseases killing young people (suicide, MVCs, overdose) even when total death counts are modest
— Global Burden of Disease framework
— Kaplan-Meier curve: step function showing cumulative survival over time
— Median survival = time at which 50% of cohort has died (read off K-M curve)
— 5-year survival = % alive at 5 years (NOT cure rate; not the same as case fatality)
— Log-rank test: compares two survival curves
— Cox proportional hazards model: generates HR while adjusting for covariates

— Higher baseline incidence of nearly all chronic diseases (cancer, HF, dementia, CKD)
— Competing risks dominate: an 85-year-old "at risk" for colon cancer death is more likely to die first of cardiovascular disease, distorting cumulative incidence calculations
— Prevalence balloons because chronic disease duration is long and accumulates with age
— Standard Kaplan-Meier overestimates cumulative incidence when competing causes of death are common
— Cumulative incidence function (CIF) or Fine-Gray subdistribution hazard model handles competing risks properly
— Step 3 won't ask the math, but will ask which patients need adjusted analyses — answer: elderly, multimorbid, oncology cohorts
— Always report age-specific rates when describing geriatric disease patterns
— Age-adjusted rates can hide the steep age gradient
— Elderly research participants are healthier than age peers (selection bias)
— Inflates apparent benefits of interventions in observational studies; less of an issue in RCTs
— Often excluded from RCTs → external validity (generalizability) is limited
— When extrapolating RCT incidence/mortality data to CKD or cirrhotic patients, expect higher event rates and more adverse drug effects than the trial showed

— Maternal mortality ratio: maternal deaths during pregnancy or within 42 days postpartum ÷ 100,000 live births (US: ~22 per 100,000 — high for a developed country)
— Pregnancy-related mortality: extends to 1 year postpartum
— Infant mortality rate (IMR): deaths <1 year of age ÷ 1,000 live births (US: ~5.4)
— Neonatal mortality: deaths <28 days ÷ 1,000 live births
— Perinatal mortality: stillbirths ≥20–28 weeks + early neonatal deaths ÷ 1,000 total births (births + stillbirths)
— Stillbirth (fetal death) rate: fetal deaths ≥20 weeks ÷ 1,000 total births
— Maternal mortality uses live births (not pregnancies) — a convention, not a perfect denominator
— Perinatal mortality uniquely includes stillbirths in numerator AND denominator — different from infant mortality
— Under-5 mortality rate: deaths before age 5 ÷ 1,000 live births — key global health metric
— Childhood incidence of vaccine-preventable disease is a sentinel surveillance metric
— Health disparities are described as differences in incidence, prevalence, or mortality between groups
— Black maternal mortality in the US is ~3× white maternal mortality — a Step 3 ethics/health systems favorite

— Berkson bias: hospital-based case-control studies overestimate exposure-disease associations because hospitalization itself is influenced by both
— Healthy worker effect: occupational cohorts show falsely low mortality (SMR <1) because sick people leave the workforce
— Non-response bias: survey-based prevalence skewed if responders differ from non-responders
— Recall bias (case-control studies): cases remember exposures better than controls — inflates OR
— Surveillance/detection bias: more testing → higher apparent incidence (classic example: thyroid cancer "epidemic" driven by neck ultrasound)
— Misclassification: non-differential biases toward null; differential biases either direction

— Endemic: baseline level of disease consistently present in a population
— Epidemic/outbreak: incidence exceeds expected baseline in a defined population/time
— Pandemic: epidemic spanning multiple countries/continents
— Cluster: aggregation of cases in time/space; may or may not exceed baseline
— Confirm the diagnosis and verify the outbreak
— Define a case and case-find
— Describe by person, place, time (epi curve)
— Generate hypotheses → test with case-control or cohort design
— Implement control measures, communicate, follow up
— Notifiable to local/state health departments → CDC (NNDSS)
— Examples: TB, syphilis, gonorrhea, HIV, measles, pertussis, hepatitis A/B/C, Lyme, COVID-19, Zika, foodborne pathogens
— Failure to report can carry licensure consequences
— Average secondary cases generated per infected person in a fully susceptible population
— R₀ > 1 → outbreak grows; R₀ < 1 → outbreak dies out
— Herd immunity threshold ≈ 1 – (1/R₀)
— Measles R₀ ≈ 12–18 → ~95% vaccination needed for herd immunity

— Cumulative incidence = proportion (unitless, fixed period)
— Incidence rate = events per person-time (has time in denominator)
— Use rate when follow-up varies; use cumulative incidence when everyone is followed the same period
— Both measure new cases, but attack rate is typically used in outbreaks over a defined short period and reported as a proportion
— Attack rate is essentially cumulative incidence for outbreak settings
— Crude: ignores age structure (misleading for comparisons)
— Age-specific: rate within an age stratum (e.g., 65–74 yrs)
— Age-adjusted: standardized to a reference population for fair comparison
— Mortality rate denominator = whole population
— Case fatality denominator = diseased people
— Proportionate mortality denominator = all deaths
— Same numerator (deaths from disease X), three different stories
— Lifetime risk = cumulative incidence over a lifespan (e.g., 1 in 8 women develop breast cancer in their lifetime)
— Both are observed/expected ratios from indirect standardization
— SIR uses incident cases; SMR uses deaths
— Conceptually similar; hazard is the instantaneous rate at time t, while incidence rate is averaged over an interval

— Point vs period prevalence: point = snapshot; period = anyone affected during the interval
— Prevalence vs incidence proportion: prevalence includes old cases; incidence counts only new cases during the period
— Prevalence vs cumulative incidence: a chronic disease's prevalence can exceed its cumulative incidence over the same period if old cases dominate the prevalent pool
— RR vs OR: RR for cohort/RCT; OR for case-control. When disease is common, OR exaggerates RR (overestimate of effect size)
— OR vs prevalence OR: cross-sectional studies yield prevalence ORs — these reflect existing disease, conflating incidence and duration
— RR vs HR: HR comes from time-to-event analysis; RR from fixed-period comparison
— AR vs RR: AR is absolute (subtraction), RR is relative (division)
— Students say "4× the risk" — incorrect if disease is common
— Correct statement: "4× the odds"; risk ratio is smaller when disease prevalence is high
— 1 – 5-year survival ≠ 5-year mortality unless follow-up is complete (no censoring)
— Better to report median survival and mortality rate separately
— PPV/NPV vary with prevalence
— Sensitivity/specificity do not
— Likelihood ratios are prevalence-independent ways to update pretest probability: LR+ = sens/(1–spec); LR– = (1–sens)/spec

— High incidence → primary prevention (vaccines, behavioral interventions reducing exposure)
— High prevalence → secondary prevention (screening to detect early, disease management to reduce complications)
— High case fatality → tertiary prevention (improving treatment, hospice/palliative care, rehab)
— Grade A: high certainty of substantial net benefit — recommend (e.g., colorectal cancer screening 45–75)
— Grade B: high certainty of moderate OR moderate certainty of substantial net benefit — recommend (e.g., AAA screening in men 65–75 who ever smoked)
— Grade C: small net benefit; offer selectively based on individual circumstances
— Grade D: no net benefit or harms outweigh — do NOT offer (e.g., PSA screening >70)
— Grade I: insufficient evidence
— Choose interventions for high PAR%, not just high RR
— Example: modest BP control RR ≈ 0.7 for stroke, but BP affects most adults → enormous PAR
— Like NNT but for screening — how many people must be screened to prevent one death
— Mammography NNS for breast cancer mortality ≈ 1,000–2,000 over 10 years (age-dependent)
— Quality measures (HEDIS): proportions of diabetics with HbA1c <8, BP control rates, screening completion rates — all prevalence-based denominators
— Value-based care payments often tied to risk-adjusted outcome rates (mortality, readmission) to avoid penalizing safety-net hospitals serving sicker populations

— Vital statistics (NCHS): birth and death certificates → mortality, IMR, life expectancy
— NHANES: National Health and Nutrition Examination Survey — prevalence of conditions (HTN, DM, obesity) with exam/lab data
— BRFSS: Behavioral Risk Factor Surveillance System — telephone survey of risk behaviors and self-reported prevalence
— SEER: Surveillance, Epidemiology, and End Results — cancer incidence, prevalence, survival
— NNDSS: National Notifiable Disease Surveillance System — reportable infectious diseases
— MMWR: Morbidity and Mortality Weekly Report — CDC's surveillance bulletin
— Death certificates: complete but with cause-of-death misclassification
— Surveys: self-report bias, non-response bias
— Registries: high-quality but disease-specific
— Rising incidence + stable mortality → improved detection (e.g., thyroid microcarcinomas)
— Stable incidence + falling mortality → better treatment (e.g., HIV after ART)
— Rising incidence + rising mortality → true increase in disease burden
— Falling incidence + falling mortality → successful primary prevention (e.g., gastric cancer in US)
— Period life expectancy: current cross-sectional age-specific mortality applied to a hypothetical cohort
— Sensitive to deaths at young ages (1 infant death "costs" ~80 life-years)
— US life expectancy fell during the opioid epidemic and COVID-19 — both surveillance findings tied to specific causes
— Communicate absolute risks, not just relative risks
— Use natural frequencies ("3 out of 1,000") rather than probabilities (0.3%) — better comprehension

— Reportable diseases override individual confidentiality under public health law — patients cannot opt out
— HIPAA explicitly permits disclosure to public health authorities (45 CFR 164.512)
— Tarasoff-type duties (threats to identified victims) extend to specific infectious disease contexts in some states (e.g., partner notification for HIV in jurisdictions with named-partner requirements)
— Epidemiologic research using identifiable data requires IRB review and usually informed consent
— De-identified surveillance data may be analyzed under HIPAA's public health exception without individual consent
— Genetic epidemiology raises unique consent issues — incidental findings, family implications, GINA protections against employment/insurance discrimination
— Reporting incidence/mortality by race, ethnicity, sex, and SES is a Step 3 patient safety priority
— Failure to disaggregate masks disparities (e.g., aggregate "Asian American" data obscures Filipino-American HTN burden)
— Hospital discharge to communities with poor primary care access → measurable readmission risk
— Step 3 vignettes test whether you arrange follow-up appropriate to local resources, not just the textbook ideal
— Public health authorities can compel isolation for certain communicable diseases (e.g., active untreated TB) under state law
— Must use least restrictive means; due process applies
— Patient safety: a physician who knowingly allows an infectious patient to be discharged to a congregate setting without notification creates both clinical and legal liability

— Prevalence ≈ Incidence × Duration (rare disease, steady state)
— RR = [A/(A+B)] / [C/(C+D)]
— OR = (A×D) / (B×C)
— NNT = 1 / ARR
— AR = I_exposed – I_unexposed
— AR% = (RR–1)/RR
— PAR% drives population intervention priority
— Herd immunity threshold = 1 – (1/R₀)
— Sensitivity = TP/(TP+FN); Specificity = TN/(TN+FP)
— PPV = TP/(TP+FP); depends on prevalence
— LR+ = sens/(1–spec); LR– = (1–sens)/spec
— US life expectancy ~ 77 years (post-COVID)
— US infant mortality ~ 5.4 / 1,000 live births
— US maternal mortality ~ 22 / 100,000 live births (high for OECD)
— Leading cause of death US: heart disease, then cancer, then unintentional injuries
— Leading cause of cancer death: lung (both sexes)
— Leading cause of death age 1–44: unintentional injuries
— Colorectal: 45–75, multiple modalities
— Mammography: 40–74, biennial (2024 update)
— Cervical: Pap 21–29 q3y; Pap+HPV co-test 30–65 q5y
— Lung (LDCT): 50–80, ≥20 pack-years, current/quit within 15 years
— AAA: men 65–75 who ever smoked, one-time ultrasound
— Rare disease → case-control
— Rare exposure → cohort
— Causation proof → RCT
— Genetic/familial → twin or family studies
— Generating hypothesis → cross-sectional or ecologic

— Stem describes cohort followed over time with new cases
— Action: cumulative incidence = new cases / population at risk at start
— Watch for person-time wording → switch to incidence rate
— Cross-sectional snapshot wording: "currently have," "at the time of the survey"
— Action: existing cases / total population at that moment
— Burden of chronic disease in a community → prevalence
— Risk of developing disease → incidence
— How deadly once you have it → case fatality
— How exposure causes disease → RR or OR
— Public health priority → PAR or AR
— Rising survival without falling mortality → lead-time/length-time bias or overdiagnosis
— Rising prevalence with stable incidence → improved survival (longer duration)
— Falling mortality with stable incidence → improved treatment
— Higher crude mortality in State A vs B → age structure difference, use age-adjusted rates
— Calculate attack rates by exposure
— Compare attack rates → identify exposure
— Notify health department (always)
— Calculate sens/spec from 2×2
— PPV/NPV require prevalence — Bayesian thinking
— "Why does the same test perform differently?" → prevalence differs
— Different OR vs RR magnitudes → check disease frequency
— Different incidence rates → check person-time methodology
— Reportable disease → notify health department
— Active TB refuses treatment → court-ordered isolation permitted
— Patient privacy vs public health → public health usually wins for notifiable diseases

Incidence counts new cases over time, prevalence counts existing cases at a moment, mortality counts deaths in a population — and choosing the right denominator and time frame is the entire game of clinical epidemiology.
— Incidence: new cases ÷ population at risk over time
— Prevalence: existing cases ÷ total population at a point
— Mortality: deaths ÷ total population over time (a special incidence)
— Case fatality: deaths ÷ diseased population (a special mortality)
— RCT/cohort → RR, ARR, NNT, incidence
— Case-control → OR (approximates RR if disease rare)
— Cross-sectional → prevalence, prevalence OR
— Time-to-event → HR, Kaplan-Meier, log-rank
— Lead-time and length-time bias inflate survival without changing mortality
— Confounding distorts associations (fix with randomization or regression)
— Selection bias must be prevented at design
— Ecologic fallacy: don't infer individuals from groups

