Evaluating Large Language Model‐Assisted Emergency Triage: A Comparison of Acuity Assessments by GPT‐4 and Medical Experts

Por: Gal Ben Haim · Mor Saban · Yiftach Barash · David Cirulnik · Amit Shaham · Ben Zion Eisenman · Livnat Burshtein · Orly Mymon · Eyal Klang

ABSTRACT

Aim

To evaluate the accuracy of the Emergency Severity Index (ESI) assignments by GPT-4, a large language model (LLM), compared to senior emergency department (ED) nurses and physicians.

Method

An observational study of 100 consecutive adult ED patients was conducted. ESI scores assigned by GPT-4, triage nurses, and by a senior clinician. Both model and human experts were provided the same patient data.

Results

GPT-4 assigned a lower median ESI score (2.0) compared to human evaluators (median 3.0; p < 0.001), suggesting a potential overestimation of patient severity by the LLM. The results showed differences in the triage assessment approaches between GPT-4 and the human evaluators, including variations in how patient age and vital signs were considered in the ESI assignments.

Conclusion

While GPT-4 offers a novel methodology for patient triage, its propensity to overestimate patient severity highlights the necessity for further development and calibration of LLM tools in clinical environments. The findings underscore the potential and limitations of LLM in clinical decision-making, advocating for cautious integration of LLMs in healthcare settings.

Reporting Method

This study adhered to relevant EQUATOR guidelines for reporting observational studies.

FreshRSS