Great framework, Byron. One dimension I’m thinking about that deserves explicit inclusion: appropriate non-use.
MedEval evaluates how well AI performs when consulted, but doesn’t address whether the consultation should have happened at all. In practice, the biggest failure mode may not be a wrong diagnosis but a system that encourages overconsumption of care. Health anxiety scales differently when you give people a frictionless 24/7 medical chatbot.
Should MedEval include metrics for how well a system closes conversations, redirects to real-world care, confidently recommends doing nothing, or reduces its own utilization over time?
Very interesting as always. This stuck out: "other factors become more nuanced to “measure” per se versus observe - especially complex, cascading failure modes and nuanced situational deviations." Interesting to think about how the greatest impact could be a tiny error that cascades to catastrophe.
Byron, this is a vital shift. Moving from static benchmarks to "dynamic improv" via MedEval is exactly the evolution clinical AI needs.
However, as a clinician-coder using these tools in high-risk pregnancy, I see a critical missing layer: Serial Assessment (Consistency). The most significant "turn-off" for physicians today is the stochastic inconsistency of LLMs. We’ve all seen an AI suggest an "off" management plan, only for a "refresh" of the same data to yield the correct trajectory. In clinical practice, if the AI is a moving target, it’s a trust-breaker.
Why MedEval needs a "Consistency" metric:
The Reliability Tax: If a physician has to prompt twice to get the "right" version of the AI’s logic, the tool has failed.
Uncertainty Signaling: High variance in outputs for the same "AI Patient" is a failure mode in itself—it signals that the model is "guessing" rather than reasoning from ground truth.
Care Worthiness: True clinical utility requires not just a "good" answer, but a reproducible one.
If we are moving from "static to dynamic," we must ensure we don't move from "reliable to erratic." Testing for output stability on identical data should be a core component of the MedEval framework.
This is a great addition and a fully agree. This then opens the question of what happens when there is a “disagreement” - does that require escalation or other mechanisms?
Another observation on this point: physician practice today introduces wide variations in care (this has been documented for decades obviously). Although AI systems can reduce variability, they move “all at once.” This creates a larger surface area when a change is introduced to the system.
One could imagine a world of phased deployments - introduce an update, monitor performance and impacts, then roll out to entire system. Think stepped-wedge but at scale.
Appreciate the frame and this is a much needed concept. Should vendors incorporate this and report out so that buyers evaluate or should buyers (health systems, providers, etc …) run this continually on the AI systems they deploy?
Great framework, Byron. One dimension I’m thinking about that deserves explicit inclusion: appropriate non-use.
MedEval evaluates how well AI performs when consulted, but doesn’t address whether the consultation should have happened at all. In practice, the biggest failure mode may not be a wrong diagnosis but a system that encourages overconsumption of care. Health anxiety scales differently when you give people a frictionless 24/7 medical chatbot.
Should MedEval include metrics for how well a system closes conversations, redirects to real-world care, confidently recommends doing nothing, or reduces its own utilization over time?
I absolutely agree with this. It’s a great way to assess whether guardrails are performing as designed and to make improvements.
Very interesting as always. This stuck out: "other factors become more nuanced to “measure” per se versus observe - especially complex, cascading failure modes and nuanced situational deviations." Interesting to think about how the greatest impact could be a tiny error that cascades to catastrophe.
Byron, this is a vital shift. Moving from static benchmarks to "dynamic improv" via MedEval is exactly the evolution clinical AI needs.
However, as a clinician-coder using these tools in high-risk pregnancy, I see a critical missing layer: Serial Assessment (Consistency). The most significant "turn-off" for physicians today is the stochastic inconsistency of LLMs. We’ve all seen an AI suggest an "off" management plan, only for a "refresh" of the same data to yield the correct trajectory. In clinical practice, if the AI is a moving target, it’s a trust-breaker.
Why MedEval needs a "Consistency" metric:
The Reliability Tax: If a physician has to prompt twice to get the "right" version of the AI’s logic, the tool has failed.
Uncertainty Signaling: High variance in outputs for the same "AI Patient" is a failure mode in itself—it signals that the model is "guessing" rather than reasoning from ground truth.
Care Worthiness: True clinical utility requires not just a "good" answer, but a reproducible one.
If we are moving from "static to dynamic," we must ensure we don't move from "reliable to erratic." Testing for output stability on identical data should be a core component of the MedEval framework.
This is a great addition and a fully agree. This then opens the question of what happens when there is a “disagreement” - does that require escalation or other mechanisms?
Another observation on this point: physician practice today introduces wide variations in care (this has been documented for decades obviously). Although AI systems can reduce variability, they move “all at once.” This creates a larger surface area when a change is introduced to the system.
One could imagine a world of phased deployments - introduce an update, monitor performance and impacts, then roll out to entire system. Think stepped-wedge but at scale.
Appreciate the frame and this is a much needed concept. Should vendors incorporate this and report out so that buyers evaluate or should buyers (health systems, providers, etc …) run this continually on the AI systems they deploy?
Both are true. Larger buyers will be more involved in site-specific monitoring. You can think of it like the new telemetry lab.