To explore whether large language models (LLMs), Generative Pre-trained Transformer (GPT)-3, GPT-3.5 and GPT-4 can autonomously manage a virtual fracture clinic (VFC) as a marker of their efficacy in an emergency department and with simple orthopaedic trauma.
Simulated UK VFC workflow.
11 clinical scenarios were generated, and GPT-4, GPT-3.5 and GPT-3 were prompted to write clinic letters and management plans.
The Readable Tool was used to assess the clarity of letters. Six independent orthopaedic surgeons then evaluated the accuracy of letters and management plans.
Readability was compared using the Flesch-Kincaid grade level: GPT-4: 9.11 (SD 0.98); GPT-3.5: 8.77; GPT-3: 8.47, and the Flesch readability ease: GPT-4: 56.3; GPT-3.5: 58.2; GPT-3: 59.3. Surgeon-rated accuracy comparisons indicated that GPT-4 exhibited the highest accuracy for management plans (9.08/10 (95% CI 8.25 to 9.9)). This represents a statistically significant progression in the capacity of a LLM to provide accurate management plans compared with GPT-3 at 6.84 (95% CI 5.41 to 8.27) and GPT-3.5 at 7.63 (95% CI 7.23 to 8.13) (p
LLMs can produce high-quality, readable clinical letters for common VFC presentations, and GPT-4 can generate management plans to aid clinicians in their administration. With clinician oversight, appropriately trained LLMs could meaningfully reduce routine administrative work. However, while the results of this study are promising, further evaluation of LLMs is required before they can be deemed safe for managing simple orthopaedic scenarios.