by Viet Anh Nguyen, Van Hung Nguyen, Thi Quynh Trang Vuong, Quoc Thanh Truong, Thi Trang Nguyen
Large language models (LLMs) are increasingly explored as diagnostic copilots in digital pathology, but whether the newest reasoning-augmented architectures provide measurable benefits over earlier versions is unknown. We compared OpenAI’s o3 model, which uses an iterative planning loop, with the baseline GPT-4o on 459 oral and maxillofacial (OMF) cases drawn from standard textbooks. Each case consisted of two to five high-resolution haematoxylin-and-eosin micrographs, and both models were queried in zero-shot mode with an identical prompt requesting a single diagnosis and supporting microscopic features. Overall, o3 correctly classified 31.6% of cases, significantly surpassing GPT-4o at 18.7% (Δ = 12.9%, P