Issue 14

A Tale of Two Models: What the NHS-Galleri Trial and CISNET Can Teach Us About Building Simulation Models

Download Issue
Resources > Curta on Call

By Scott Ramsey, MD, PHD

Senior Partner and Chief Medical Officer, Curta

Adjunct Professor at the University of Washington, School of Pharmacy, CHOICE Institute

Professor at the University of Washington, School of Medicine

Many of us are passionate about the roles that decision models can have in healthcare research and policy. Places we’ve used models at Curta and in in our academic lives include projecting clinical endpoints, powering trials, and exploring scenarios that cannot be directly observed.  While these are worthy applications, many of us have had experiences where our effort seems to fall into a void, ultimately having little impact real world decisions. We’ve written previously about why this can happen (see CoC #13). But what about the counterfactual: when you know that your model will end up influencing some very big decisions? Modelers dream of having such an experience. But we all know that while getting our dream job can end up wonderfully, there are other outcomes that are the stuff of nightmares. Below are two examples of the best of times and the worst of times for modelers and their models.

GRAIL Microsimulation Model and NHS-Galleri Trial

Grail Corporation is a healthcare company focused on multi-cancer early detection (MCED) through blood-based screening. Its flagship product, the Galleri blood test, uses cell-free DNA methylation patterns to detect signals from over 50 cancer types. As Galleri went through preclinical and clinical testing, a team at the company was building a microsimulation model to project the clinical utility of the Galleri test. The model simulated individual cancer natural histories across 23 cancer classes using a stage-transition framework and sojourn time assumptions to estimate how many late-stage diagnoses screening would prevent. Results were calibrated to SEER data.

One of the early products of the model was an estimate of the potential cost-effectiveness of Galleri versus usual care.1 While this was of interest, Grail had much bigger plans for the model. The model projected that three rounds of annual screening would reduce stage III/IV cancer incidence by 9%–24% and cancer-specific mortality by 13%–16%. The expected stage shift the model predicted from screening was used to inform the sample size and power calculations for the NHS-Galleri trial — a randomized controlled trial of 142,000 participants aged 50–77 in England.2

Unfortunately for Grail, the trial did not meet its primary endpoint of a statistically significant reduction in combined Stage III–IV cancers across a prespecified group of 12 deadly cancer types.3 Recent analyses (available online but not yet peer-reviewed4) suggests the model failed to adequately account for real-world diagnostic latency. Latency in cancer screening parlance is the time between a positive screening signal and confirmatory diagnosis. A Monte Carlo reconstruction of the trial results introduced the concept of “stage slip”: cancers detected by Galleri while biologically early-stage but not confirmed until they had progressed to late-stage, due to NHS diagnostic infrastructure delays averaging 90–120 days. This analysis estimated roughly 84 cases “slipped” from early to late stage in the intervention arm, substantially diluting the observed stage shift and likely explaining the missed endpoint. Two other factors may have contributed to the model’s over-projection: a) dwell-time scenarios that were hypothetical rather than empirically constrained, with trial design apparently informed by the more optimistic scenarios; and b) realized in-trial test sensitivity substantially below those used as model inputs.

The model’s structure was probably reasonable, and the authors understood its central limitation: the dwell-time scenarios for most of the cancers were hypothetical rather than empirically derived. The task the model was asked to perform was tremendous — no prior screening trial existed for this technology, no natural history data existed for most cancer types, and no empirical sojourn time estimates were available. Aggregating 23 heterogeneous cancer types compounded the uncertainty. As noted above, the modelers had to guess how efficiently and effectively NHS patients with positive Galleri tests would come to diagnostic resolution.

“The task the model was asked to perform was tremendous — no prior screening trial existed for this technology, no natural history data existed for most cancer types, and no empirical sojourn time estimates were available.”

The fallout? Grail’s stock fell by nearly 50% after the announcement. It’s path towards FDA regulatory approval as a cancer screening test has become much more complicated. The CEO stepped down.5 Pundits and experts excoriated Grail for the disconnect between its hype and the findings.  If the model’s predictions led to a misspecified trial, this was a very expensive mistake.

The CISNET Colorectal Cancer Models and NordICC

A contrasting story comes from the CISNET colorectal cancer consortium. With funding from the National Cancer Institute, three independent modeling groups developed colorectal cancer (CRC) microsimulation models calibrated to US data (SEER incidence, adenoma prevalence from autopsy studies, and survival) with natural history parameters grounded in decades of empirical observation.

These models were externally validated against the UK Flexible Sigmoidoscopy Screening Trial (UKFSS).6 All three models accurately predicted the trial’s 10-year CRC mortality reduction using data that was never part of the calibration. The validation also revealed which model assumptions about adenoma dwell time were most consistent with observed data, directly improving the models.

The models were then re-validated against 17-year UKFSS follow-up data.7 This second validation, now with substantially longer follow-up, confirmed that all three models predicted relative hazard ratios for CRC incidence and mortality that were reasonably close to observed estimates. Critically, between the 10-year and 17-year validations, one model (CRC-SPIN) had been recalibrated, incorporating UKFSS baseline screening-detection rates into its updated version (CRC-SPIN 2.0). The iterative cycle of validate–recalibrate–revalidate is precisely the mechanism through which models improve.

“Structural independence may confer advantages in external validity that empirical fine-tuning does not.”

The 17-year validation also revealed instructive discrepancies. All three models overpredicted absolute CRC incidence in the control group, likely reflecting the difference between their US calibration population and the UK trial population, as well as unmeasured screening contamination in the control arm. Location-specific hazard ratios diverged substantially across models, suggesting that site-specific natural history processes remained underspecified. And notably, SimCRC, the only model that had not incorporated any UKFSS data into its calibration, produced relative-effect predictions closer to the observed hazard ratios than the models that had. This is a striking finding: structural independence may confer advantages in external validity that empirical fine-tuning does not.

Then came the real test. In 2022, the NordICC trial, the first-ever RCT of colonoscopy screening in average-risk adults, reported results that appeared disappointing: only 18% incidence reduction and 10% mortality reduction.8 Media coverage questioned the value of colonoscopy. The trial investigators themselves noted results were lower than anticipated based on modeling studies.

But were the models actually wrong? Van den Berg and colleagues simulated the NordICC trial conditions using the same three CISNET-CRC models critically accounting for the trial’s 42% screening uptake and 10-year follow-up.9 Model predictions of 11%–28% incidence reduction and 24%–32% mortality reduction overlapped with the NordICC confidence intervals. The models hadn’t failed; people had been comparing trial results at 42% uptake to model projections assuming 100% uptake. The models weren’t wrong. The comparison was wrong.

GRAIL vs CISNET: A Fair Comparison?

Of course, there are limits to comparing the Grail and CISNET experiences. The CISNET-CRC models focused on one well-characterized cancer with a known adenoma-to-carcinoma progression sequence, versus Grail addressing 23 cancers, most lacking natural history data, CISNET also had the advantage of sequential external validation against a single RCT at two time points (10 and 17 years) plus an independent RCT (NordICC), versus no prior trial of the technology; and three independent modeling groups providing structural uncertainty assessment, versus a single team. Comparing the two is a bit like grading a weather forecast for tomorrow against one for next year. Still, the comparison is instructive because it highlights the conditions under which models earn our trust.

Lessons for Model-Building

George Box famously observed that all models are wrong, but some are useful. These two examples point to what makes the difference.

  1. Multi-group involvement provides structural validation. When independent teams with different assumptions reach similar conclusions, the finding is more robust. A single model cannot assess its own structural uncertainty. The 17-year UKFSS validation reinforced this: SimCRC’s fully independent calibration produced relative-effect predictions at least as accurate as models that had incorporated trial data demonstrating that structural diversity is itself a form of evidence.
  2. Iterative external validation builds credibility. The CISNET-CRC models were not validated once; they were validated iteratively. The 10-year UKFSS validation revealed that models with longer adenoma dwell times better fit the data, prompting recalibration. The 17-year validation confirmed that updated models tracked long-term outcomes and identified new discrepancies: location-specific divergence and absolute-rate overprediction that point to the next round of improvements. The NordICC analysis then demonstrated the models’ explanatory power under entirely novel trial conditions. Each validation made the models more trustworthy. The Dai model had no comparable opportunity.
  3. Empirical grounding constrains the possible. The CISNET-CRC models drew on decades of adenoma prevalence, incidence, and screening trial data. When empirical data are unavailable as was unavoidably the case for MCED screening, projections should be presented as conditional on assumptions that may prove wrong.
  4. Most importantly, models should explain, not just predict. The CISNET-CRC team didn’t just show their models matched NordICC. They showed why the results looked disappointing. There was low uptake and limited follow-up, not ineffective screening. The 17-year UKFSS validation similarly diagnosed why absolute rates diverged (US vs. UK calibration, control-group contamination) even as relative effects tracked well. In contrast, the Grail model appears to have overlooked expected variability in follow-up of abnormal tests That explanatory power is what makes models genuinely useful for policy. Models are at their best when they tell us what we don’t know, not when they tell us what we want to hear.

“When those elements are in place, models don’t just predict they explain, they improve, and they earn the trust needed to change policy.”

NHS-Galleri is not an indictment of modeling. The Dai model attempted something extraordinarily ambitious with the best available data. But the contrast with CISNET-CRC reminds us that model credibility is earned through empirical grounding, iterative external validation, and multi-group scrutiny. When those elements are in place, models don’t just predict they explain, they improve, and they earn the trust needed to change policy. When they are absent, even well-constructed models can mislead.

Final Word

The lesson is not that modeling fails, but that its success depends on how it is done. Curta’s approach is grounded in the same principles that distinguish the most credible and decision-relevant models: empirical discipline where data exist, transparency where they do not, and an iterative cycle of validation and refinement. By bringing together multiple perspectives, stress-testing assumptions against real-world conditions, and focusing on explanatory power rather than point predictions alone, Curta helps ensure that models do not operate in isolation or false precision. Instead, they become tools that clarify uncertainty and ultimately support decisions that matter.

– David Veenstra, PharmD, PhD

Senior Partner, Curta
Professor, University of Washington, School of Pharmacy

– Scott Ramsey, MD, PhD

Senior Partner and Chief Medical Officer, Curta
Professor, Fred Hutch Cancer Center
Professor, University of Washington, School of Medicine

References

  1. Tafazzoli A, Ramsey SD, Shaul A, et al. The Potential Value-Based Price of a Multi-Cancer Early Detection Genomic Blood Test to Complement Current Single Cancer Screening in the USA. Pharmacoeconomics. 2022 Nov;40(11):1107-1117.
  2. Dai JY, Zhang J, Braun JV, et al. Clinical performance and utility: A microsimulation model to inform the design of screening trials for a multi-cancer early detection test. J Med Screen. 2024 Sep;31(3):140-149.
  3. Grail Corporation. Landmark NHS-Galleri Trial Demonstrates a Substantial Reduction in Stage IV Cancer Diagnoses, Increased Stage I and II Detection of Deadly Cancers, and Four-Fold Higher Cancer Detection Rate. https://grail.com/press-releases/landmark-nhs-galleri-trial-demonstrates-a-substantial-reduction-in-stage-iv-cancer-diagnoses-increased-stage-i-and-ii-detection-of-deadly-cancers-and-four-fold-higher-cancer-detection-rate/.
  4. Bellout H.Stage Slip from Diagnostic Latency in MCED Trials: A Calibrated Monte Carlo Reconstruction of the NHS-Galleri Results. MedRxIV 2026. https://www.medrxiv.org/content/10.64898/2026.03.01.26347360v1.full
  5. GRAIL presses on with Galleri test despite missed primary endpoint in pivotal study. The Cancer Letter. March 6, 2026. https://cancerletter.com/news-analysis/20260306_1/.
  6. Rutter CM, Knudsen AB, Marsh TL, et al. Validation of Models Used to Inform Colorectal Cancer Screening Guidelines: Accuracy and Implications. Med Decis Making. 2016 Jul;36(5):604-14.
  7. DeYoreo M, Lansdorp-Vogelaar I, Knudsen AB, et al. Validation of Colorectal Cancer Models on Long-term Outcomes from a Randomized Controlled Trial. Med Decis Making. 2020 Nov;40(8):1034-1040.
  8. Bretthauer M, Løberg M, Wieszczy P, et al; NordICC Study Group. Effect of Colonoscopy Screening on Risks of Colorectal Cancer and Related Death. N Engl J Med. 2022 Oct 27;387(17):1547-1556.
  9. van den Berg DMN, Nascimento de Lima P, Knudsen AB, et al; Cisnet-Colon Group. NordICC Trial Results in Line With Expected Colorectal Cancer Mortality Reduction After Colonoscopy: A Modeling Study. Gastroenterology. 2023 Oct;165(4):1077-1079.e2.