Biomedical AI Models Struggle with Conflicting Evidence, Study Finds
A recent study published on arXiv found biomedical large language models (LLMs) experience declines in accuracy and consistency when presented with conflicting or reordered evidence. Researchers evaluated six open-weight LLMs using the HealthContradict dataset, simulating scenarios with no context, correct-only context, incorrect-only context, and mixed conditions. The results showed prediction ‘flips’—where models changed answers based on evidence order—and accuracy drops of up to 40% in conflicting contexts.
The research team introduced a novel ‘conflict-aware abstention score’ to quantify model uncertainty in such scenarios. This metric aims to improve reliability by flagging cases where evidence contradictions make confident predictions untrustworthy. The study highlights an important gap in current evaluation practices, which often prioritize accuracy under ideal conditions over robustness in real-world ambiguity.
Retrieval-augmented LLMs, which combine external knowledge with model-generated responses, are increasingly used in healthcare applications. However, the findings underscore risks in relying on these systems for critical decisions when source materials contain contradictions. The authors advocate for improved training methods and evaluation frameworks focused on conflict resolution.