FreshRSS

🔒
☐ ☆ ✇ BMJ Open

How have generic large language models progressed in their ability to write clinic letters and provide accurate management plans in the virtual fracture clinic?

Por: Smith · A. · Brock · J. · Jones · H. · Solari · F. · Anss · R. · Kimberley · C. · Joyner · C. · Yasin · T. · Basbous · O. · Poacher · A. T. — Diciembre 16th 2025 at 05:04
Objective

To explore whether large language models (LLMs), Generative Pre-trained Transformer (GPT)-3, GPT-3.5 and GPT-4 can autonomously manage a virtual fracture clinic (VFC) as a marker of their efficacy in an emergency department and with simple orthopaedic trauma.

Setting and participants

Simulated UK VFC workflow.

Design

11 clinical scenarios were generated, and GPT-4, GPT-3.5 and GPT-3 were prompted to write clinic letters and management plans.

Main outcome measures

The Readable Tool was used to assess the clarity of letters. Six independent orthopaedic surgeons then evaluated the accuracy of letters and management plans.

Results

Readability was compared using the Flesch-Kincaid grade level: GPT-4: 9.11 (SD 0.98); GPT-3.5: 8.77; GPT-3: 8.47, and the Flesch readability ease: GPT-4: 56.3; GPT-3.5: 58.2; GPT-3: 59.3. Surgeon-rated accuracy comparisons indicated that GPT-4 exhibited the highest accuracy for management plans (9.08/10 (95% CI 8.25 to 9.9)). This represents a statistically significant progression in the capacity of a LLM to provide accurate management plans compared with GPT-3 at 6.84 (95% CI 5.41 to 8.27) and GPT-3.5 at 7.63 (95% CI 7.23 to 8.13) (p

Conclusions

LLMs can produce high-quality, readable clinical letters for common VFC presentations, and GPT-4 can generate management plans to aid clinicians in their administration. With clinician oversight, appropriately trained LLMs could meaningfully reduce routine administrative work. However, while the results of this study are promising, further evaluation of LLMs is required before they can be deemed safe for managing simple orthopaedic scenarios.

❌