Biostatistics & Population Health
Sensitivity and specificity: clinical interpretation
| • Sensitivity (Sn) = P(test positive | disease present) = TP / (TP + FN) |
| — Property of the test in diseased patients; answers "how good is this test at catching disease?" | |
| — High-Sn tests have few false negatives → a negative result rules out ("SnNout") | |
| • Specificity (Sp) = P(test negative | disease absent) = TN / (TN + FP) |
| — Property of the test in non-diseased patients | |
| — High-Sp tests have few false positives → a positive result rules in ("SpPin") | |
| • When this matters on Step 3: | |
| — Choosing a screening vs confirmatory test (HIV 4th-gen Ag/Ab → confirmatory differentiation assay) | |
| — Interpreting an unexpected result in a low- or high-prevalence setting | |
| — Counseling patients about false reassurance or false alarms | |
| — Quality improvement, USPSTF grade rationale, public-health screening programs | |
| • Key distinction: Sn and Sp are intrinsic to the test and (classically) do not change with disease prevalence. What changes with prevalence is PPV/NPV. Step 3 stems love to swap these. | |
| • Clinical scenario triggers to suspect a stats question: | |
| — A 2×2 table appears in the stem | |
| — Words like "screening," "cutoff value," "ROC curve," "rule in/rule out" | |
| — A patient asks "Doctor, what does this positive result mean for me?" → that is almost always a PPV question, not a sensitivity question | |
| • Board pearl: If the question gives you the test result first and asks "what is the probability the patient has disease?", you need predictive values, not Sn/Sp. If it gives you disease status first and asks about test behavior, you need Sn/Sp. The direction of the conditional probability is the entire trick. | |
| • Step 3 also tests pre-test probability reasoning: order tests where Sn/Sp meaningfully shifts post-test probability, not reflexively. |

— The 2×2 table stem: A study of 1,000 patients yields TP/FP/FN/TN counts; calculate Sn, Sp, PPV, or NPV.
— The cutoff-shift stem: "Investigators lower the BNP cutoff from 100 to 50 pg/mL." → Sn ↑, Sp ↓, FN ↓, FP ↑.
— The prevalence-shift stem: Same test deployed in a low-prevalence (general population) vs high-prevalence (cardiology clinic) setting; ask about PPV/NPV change.
— The clinical decision stem: Which test should be ordered first — a sensitive screen or a specific confirmatory?
— Asymptomatic patient, routine visit, USPSTF-recommended interval
— Disease where missing cases is catastrophic (HIV, syphilis, TB contact, neonatal PKU)
— Downstream confirmatory test is available and tolerable
— Positive screen already in hand
— Treatment is toxic, invasive, or stigmatizing (chemotherapy, lifelong antiretrovirals, prophylactic mastectomy)
— Labeling harm is high (Lyme serology in low-prevalence area)

— Absence of calf swelling in suspected DVT (Homans is poor; calf asymmetry >3 cm is more useful)
— Absence of fever, tachycardia, leukocytosis lowers likelihood of bacteremia but does not exclude it
— Normal mental status in suspected meningitis lowers but doesn't exclude (Sn of classic triad ~44%)
— Kernig and Brudzinski signs: low Sn (~5%) but high Sp (~95%) for meningitis → presence is meaningful, absence is not
— Murphy sign for acute cholecystitis: moderate Sn, high Sp
— Janeway lesions, Osler nodes, Roth spots in endocarditis: low Sn but near-pathognomonic
— S3 gallop for heart failure: Sn ~13%, Sp >90% — finding it nearly clinches volume overload
— Example: S3 gallop LR+ ≈ 11 for HF; a single finding shifts post-test probability dramatically.

```
Disease+ Disease−
Test+ TP FP
Test− FN TN
```
— Sensitivity = TP / (TP + FN) — column-based, disease+ column
— Specificity = TN / (TN + FP) — column-based, disease− column
— PPV = TP / (TP + FP) — row-based, test+ row
— NPV = TN / (TN + FN) — row-based, test− row
— Prevalence = (TP + FN) / total
— Accuracy = (TP + TN) / total — rarely the answer they want
— TP = 90, FN = 10, FP = 90, TN = 810
— PPV = 90/(90+90) = 50% — half of positives are false alarms!
— NPV = 810/(810+10) = 98.8%
— TP = 9, FN = 1, FP = 99, TN = 891
— PPV = 9/108 = 8.3% — the test is the same, but positives are mostly false
— NPV = 891/892 = 99.9%

— Area under curve (AUC): 0.5 = useless (coin flip); 1.0 = perfect discrimination
— AUC 0.7–0.8 acceptable; 0.8–0.9 excellent; >0.9 outstanding
— A curve hugging the upper-left corner = best test
— Lower cutoff (e.g., troponin 0.01 instead of 0.04) → more positives → ↑Sn, ↓Sp, ↑FP, ↓FN
— Higher cutoff → ↑Sp, ↓Sn, ↓FP, ↑FN
— Choose low cutoff for screening (don't miss disease); high cutoff for confirmation (don't overtreat)
— LR+ = Sn / (1 − Sp) — multiply pre-test odds by this when test is positive
— LR− = (1 − Sn) / Sp — multiply pre-test odds by this when test is negative
— LR+ >10 or LR− <0.1 → large, often conclusive shift
— LR+ 5–10 or LR− 0.1–0.2 → moderate shift
— LR ~1 → test is useless

— Population: asymptomatic or low pre-test probability
— Goal: don't miss disease (minimize FN)
— Examples: HIV 4th-gen Ag/Ab combo (Sn ~99.9%), mammography, fecal immunochemical test (FIT), PPD/IGRA, rapid strep antigen (paired with culture in kids if negative)
— A negative result is reassuring; a positive result requires confirmation
— Population: already screened positive or high pre-test probability
— Goal: don't falsely label (minimize FP)
— Examples: HIV-1/2 differentiation immunoassay (replaces Western blot since 2014), Treponemal-specific test after positive RPR (or reverse-sequence), colonoscopy after positive FIT, CT/MRI after positive screening mammogram + tissue biopsy
— Two tests in series, both must be positive → ↑Sp, ↓Sn (HIV algorithm)
— Two tests in parallel, either positive counts → ↑Sn, ↓Sp (trauma rule-outs)
— Below test threshold → don't test (harms > benefits, low PPV)
— Between test and treatment thresholds → test
— Above treatment threshold → treat empirically (e.g., classic angina + STEMI EKG → don't wait for troponin to revascularize)

— Positive blood culture × 2 with typical organism + clinical syndrome → empiric endocarditis therapy before echo
— Positive HIT 4Ts score + positive PF4 ELISA + functional assay → stop heparin, start argatroban; do not wait for serotonin release assay if pre-test probability is high
— STEMI on ECG → activate cath lab; do not wait for troponin (treatment threshold exceeded)
— Single positive rapid strep in adult with low Centor score — actually still treat if positive (Sp ~95%), but don't treat based on clinical impression alone if rapid is negative in adults
— Positive ANA in asymptomatic patient — Sn ~95% for SLE but Sp poor; don't start hydroxychloroquine based on titer alone
— Positive Lyme ELISA in non-endemic area — requires Western blot confirmation before doxycycline course
— When pre-test probability exceeds the treatment threshold, testing adds little — treat (e.g., empiric antibiotics in septic shock pending cultures)
— When pre-test probability is below the test threshold, neither test nor treat (e.g., D-dimer in a patient with PERC-negative chest pain)

— Step 1: 4th-gen HIV-1/2 Ag/Ab combo immunoassay (high Sn) — screens
— Step 2 (if reactive): HIV-1/2 antibody differentiation immunoassay (high Sp) — confirms and types
— Step 3 (if differentiation negative/indeterminate): HIV-1 RNA NAT — detects acute infection (window period)
— Traditional: nontreponemal (RPR/VDRL) → treponemal confirmation
— Reverse-sequence: treponemal EIA → RPR; if discordant, second treponemal test (TP-PA)
— Wells score → low: PERC or D-dimer (high Sn) → if negative, stop
— Moderate/high: CT pulmonary angiography (high Sp confirmatory)
— Age-adjusted D-dimer in >50 yo: cutoff = age × 10 ng/mL ↑Sp without losing Sn
— ECG (high Sp for STEMI, lower Sn for NSTEMI) → high-sensitivity troponin serial → stress/CT angio for intermediate risk
— IGRA or PPD (screen, high Sn) → CXR + sputum AFB × 3 + NAAT (confirm active disease, higher Sp)
— Screening mammogram → diagnostic mammogram + US → core needle biopsy (gold standard, near 100% Sp)

— D-dimer Sp falls sharply with age (baseline elevation from age, comorbidity, inflammation). Solution: age-adjusted cutoff (age × 10 ng/mL for patients >50) restores Sp without sacrificing Sn.
— BNP/NT-proBNP rise with age and falling GFR → ↓Sp for HF; use age- and renal-adjusted cutoffs (NT-proBNP >450 if <50 yo, >900 if 50–75, >1800 if >75).
— Troponin baseline elevated in CKD → ↓Sp for ACS; rely on delta (change over 1–3 hours) rather than absolute value.
— Pneumonia presentation atypical (afebrile, AMS only) → physical exam Sn drops; lower threshold to image.
— Creatinine-based eGFR loses Sn for early CKD (sarcopenia underestimates true GFR); use cystatin C or measured creatinine clearance when ambiguous.
— Contrast-enhanced CT harms in CKD → choose V/Q scan for PE if eGFR <30, accepting indeterminate results more often.
— Gadolinium restricted in eGFR <30 (NSF risk with linear agents); use macrocyclic agents cautiously or alternative imaging.
— AFP for HCC has poor Sn (~60%) and Sp (elevated in cirrhosis, pregnancy) — pair with ultrasound q6 months per AASLD guidelines; positive imaging triggers MRI/CT with LI-RADS.
— INR loses utility as a coagulation marker in cirrhosis (rebalanced hemostasis) — does not predict bleeding risk.

— D-dimer physiologically elevated → poor Sp for VTE; use CT pulmonary angiography with abdominal shielding or V/Q (lower fetal dose) rather than relying on D-dimer.
— Pregnancy serum screening:
— First-trimester combined screen (PAPP-A, β-hCG, nuchal translucency): Sn ~85% for Down syndrome
— Quad screen (AFP, β-hCG, estriol, inhibin A): Sn ~80%
— Cell-free DNA (cfDNA/NIPT): Sn >99%, Sp >99% for trisomy 21 — but in low-prevalence populations (low-risk women), PPV is only ~50–80%. Positive cfDNA still requires diagnostic amniocentesis/CVS (karyotype = gold standard).
— GBS culture at 36–37 weeks: Sn moderate; positive → intrapartum penicillin prophylaxis.
— Newborn metabolic screen uses very-high-Sn tests (PKU, congenital hypothyroidism, sickle cell, CF) — accepts low PPV because miss is catastrophic. All positives confirmed with specific assays before treatment.
— Rapid strep in children: Sn ~85%, Sp ~95% → negative requires back-up throat culture (different from adults).
— Bilirubin nomograms rather than single cutoffs to interpret jaundice risk.
— Pediatric appendicitis scores (Alvarado, PAS) stratify before imaging — US first to spare radiation, CT if equivocal.
— Wilson-Jungner criteria: disease must be common enough, detectable in latent phase, treatable, with acceptable test characteristics
— USPSTF grade D = recommends against (net harm; often low PPV scenarios)

— Unnecessary procedures: biopsy bleeding, anesthesia risk, perforation
— Overdiagnosis: indolent disease treated aggressively (thyroid microcarcinoma, low-grade prostate cancer)
— Psychological harm: anxiety, depression, labeling
— Financial harm: out-of-pocket costs, insurance implications
— Cascade testing: one false positive triggers a chain of confirmatory studies with their own harms
— Delayed diagnosis: missed PE → death; missed cancer → stage progression
— False reassurance: patient and clinician dismiss subsequent symptoms
— Liability exposure: malpractice claims frequently allege failure to diagnose
— Treating off a screening test alone (e.g., empiric SLE therapy off positive ANA) → wrong-disease toxicity
— Skipping confirmatory test (HIV differentiation, breast biopsy) → labeling and treatment errors
— PSA screening in elderly men: high false-positive rate → biopsy complications (sepsis, bleeding) for cancers that would never cause harm
— Lung cancer screening LDCT in low-risk patients (outside USPSTF criteria): high false-positive nodule rate → CT follow-ups, biopsies, pneumothorax
— Whole-body MRI/total-body CT marketed to consumers: incidentalomas in 30–40% → cascade workups

— Positive screen, negative confirmatory: likely false-positive screen; counsel and follow standard surveillance
— Negative screen, high clinical suspicion: treat suspicion — order more sensitive or gold-standard test (e.g., CT angiography after negative D-dimer in high-pre-test-probability PE)
— Indeterminate/equivocal result: reflexive next-test or specialist input
— Pathology second opinion: discordant cytology vs core biopsy, atypical findings
— Infectious disease: indeterminate HIV differentiation, discordant hepatitis serologies
— Genetics: positive cfDNA → MFM/genetics counseling before invasive testing
— Hematology: discordant HIT antibody assays (positive ELISA, pending SRA in high pre-test probability) — empiric non-heparin anticoagulant while awaiting
— Patient with high pre-test probability of life-threatening disease and a negative screening test → admit for observation and definitive testing rather than discharge (e.g., chest pain with negative initial troponin but HEART score 4–6)
— Serial testing (troponin at 0/1/3h, repeat US for appendicitis at 6h) leverages temporal change to overcome single-test limitations

— Sn = "of diseased, how many test positive"
— PPV = "of test-positive, how many are diseased"
— Same numerator (TP), different denominators — Step 3 swap.
— Sp = "of non-diseased, how many test negative"
— NPV = "of test-negative, how many are truly disease-free"
— Prevalence = existing cases / population at a point in time (cross-sectional, drives PPV/NPV)
— Incidence = new cases / person-time (longitudinal, drives risk and cohort study results)
— Accuracy = (TP+TN)/total. Misleading when disease is rare — a test that always says "negative" has high accuracy in low prevalence but Sn = 0.
— Sensitivity = validity (does it measure truth?)
— Reliability = reproducibility (does it give the same answer twice?)
— A test can be reliable but invalid (consistently wrong).
— LRs operate on test results updating pre-test probability
— ORs operate on exposure–outcome association in case-control studies
— α controlled by significance threshold (0.05)
— β controlled by power (1 − β; typically 80%)

— Sn/Sp measured in a population with severe, classic disease appears better than performance in early/mild disease seen in clinic.
— Example: A troponin assay validated on transmural MI patients overstates Sn for unstable angina.
— Only patients with positive screening tests receive the gold-standard confirmation → inflates apparent Sn, distorts Sp.
— Mitigation: study designs that apply gold standard to all participants regardless of screening result.
— Screening detects disease earlier; apparent survival from diagnosis lengthens without changing date of death → false impression of benefit.
— Screening preferentially detects slow-growing, indolent cases; aggressive cases arise between screens and present clinically → screened population appears to have better outcomes (overdiagnosis).
— Volunteers for screening differ from general population (healthier, more health-conscious — "healthy volunteer effect").
— The test being evaluated is part of the gold-standard definition → falsely inflates Sn/Sp.
— Sicker patients get more tests → biases observational comparisons of test utility.

— HCC in cirrhosis: US ± AFP every 6 months (Sn ~60–80%, accepts limitations); positive triggers MRI/CT with LI-RADS
— Colon polyp surveillance: interval colonoscopy based on polyp pathology — high-risk adenomas at 3 years, hyperplastic at 10
— Breast cancer survivors: annual mammography ± MRI in high-risk (BRCA, dense breasts); MRI added because of higher Sn in dense tissue
— HbA1c for diabetes monitoring (Sn moderate for hyperglycemia; affected by hemoglobinopathies, transfusion, anemia)
— TSH for thyroid disease follow-up (very Sn for primary thyroid disease; not for central)
— BNP/NT-proBNP to track HF — best used as trend rather than absolute
— Reconcile baseline values for future comparison
— Schedule interval-specific surveillance (e.g., post-MI lipid recheck at 4–12 weeks, repeat echo at 3 months for cardiomyopathy reversibility)
— Patient education on what new symptoms warrant re-testing vs reassurance

— Explain PPV in plain language: "Of 100 women your age with this positive screen, about X actually have cancer."
— Avoid: "Your test is positive" without context — patients hear "I have the disease."
— Use absolute numbers over relative risks; pictographs (icon arrays) improve comprehension.
— Negative routine screen → return at standard interval (mammography q1–2 years, Pap per ASCCP, colon per polyp risk)
— Positive screen → confirmatory test within an evidence-based timeframe (typically 2–6 weeks; sooner for cancer)
— Indeterminate result → repeat at defined interval (e.g., ASCUS Pap → reflex HPV; 6-month repeat imaging for Bethesda III thyroid nodules)
— USPSTF Grade C recommendations (e.g., PSA 55–69, aspirin for primary prevention) require shared decisions acknowledging the borderline PPV/benefit balance
— Document the conversation, patient values, and chosen path
— Don't stop surveillance; schedule interval re-testing appropriate to disease tempo
— Counsel "what symptoms should bring you back early"

— Genetic testing (BRCA, Lynch, expanded carrier screens): discuss PPV, implications for relatives, GINA protections for employment/insurance (but not life/disability/long-term care insurance — a Step 3 nuance)
— Direct-to-consumer testing (23andMe BRCA): variants reported are limited; negative DTC result does not exclude BRCA mutation — patients must understand
— HIV testing: opt-out consent acceptable in healthcare settings (CDC), but never coerce; pre-test counseling about implications still required for some states/populations
— Reportable conditions (HIV, syphilis, TB, gonorrhea, certain foodborne illnesses) require reporting after confirmed diagnosis, not screening positive alone
— Reporting suspected child/elder abuse: clinical suspicion alone triggers reporting; you don't need a "positive test"
— Pending test results at discharge: studies show 30–40% of inpatient discharges have pending labs; structured handoff (explicit list, responsible clinician, follow-up plan) is the standard of care
— Critical positive results (incidental lung nodule on CT, abnormal mammogram) require closed-loop communication — confirm receipt, document follow-up
— Lab mix-up or false-positive that led to procedure → full disclosure per AMA ethics; apology and corrective action improve trust and reduce litigation
— Quality-improvement reporting to lab/system to prevent recurrence
— Screening recommendations should not perpetuate disparities; recognize when test characteristics differ by population (e.g., race-based eGFR equations now revised; pulse oximetry less accurate in dark skin)


— Stem gives a 2×2 table or four numbers; asks for Sn, Sp, PPV, or NPV.
— Trap: swapping row and column denominators. Anchor: Sn/Sp use column totals (disease status); PPV/NPV use row totals (test result).
— "Investigators lower the cutoff from X to Y. How does this affect Sn/Sp/FP/FN?"
— Answer logic: lower cutoff → more test-positives → ↑Sn, ↓Sp, ↑FP, ↓FN.
— Same test in low- vs high-prevalence population; asks about PPV/NPV.
— Answer: PPV ↑ with prevalence; NPV ↓ with prevalence; Sn/Sp unchanged.
— Asymptomatic patient with positive screening test; asks next step.
— Answer: confirmatory test (higher Sp), not treatment.
— Same test result, two patients with different pre-test probabilities; asks about post-test probability or management.
— Answer: high pre-test probability + negative test ≠ ruled out; low pre-test probability + positive test ≠ ruled in (consider false positive).
— Screening trial shows "improved 5-year survival" without mortality benefit; asks for explanation.
— Answer: lead-time and/or length-time bias.
— Gives Sn and Sp; asks for LR+ or LR−.
— Answer: LR+ = Sn/(1−Sp); LR− = (1−Sn)/Sp.
— Patient asks "what does this positive result mean?"
— Answer: PPV explanation, not Sn — match the conditional probability to the patient's question.
— Test result pending at discharge; asks next step.
— Answer: closed-loop communication to PCP with documented follow-up.

Sensitivity and specificity are intrinsic properties of a diagnostic test that — combined with the patient's pre-test probability — drive Bayesian updates of post-test probability, while PPV and NPV translate those properties into the actionable clinical question the patient is actually asking.

