Discussion about this post

User's avatar
Arooshi's avatar

Great framework, Byron. One dimension I’m thinking about that deserves explicit inclusion: appropriate non-use.

MedEval evaluates how well AI performs when consulted, but doesn’t address whether the consultation should have happened at all. In practice, the biggest failure mode may not be a wrong diagnosis but a system that encourages overconsumption of care. Health anxiety scales differently when you give people a frictionless 24/7 medical chatbot.

Should MedEval include metrics for how well a system closes conversations, redirects to real-world care, confidently recommends doing nothing, or reduces its own utilization over time?

Lindsay Unmessy Stortz's avatar

Very interesting as always. This stuck out: "other factors become more nuanced to “measure” per se versus observe - especially complex, cascading failure modes and nuanced situational deviations." Interesting to think about how the greatest impact could be a tiny error that cascades to catastrophe.

5 more comments...

No posts

Ready for more?