Conducting bias assessments in systematic reviews is a time-consuming process that involves subjective judgments. The use of artificial intelligence (AI) technologies to perform these assessments can potentially save time and enhance consistency. Nevertheless, the efficacy of AI technologies in conducting bias assessments remains inadequately explored.
This study aims to evaluate the efficacy of ChatGPT-4o in assessing bias using the revised Cochrane RoB2 tool, focusing on randomized controlled trials in nursing.
ChatGPT-4o was provided with the RoB2 assessment guide in the form of a PDF document and instructed to perform bias assessments for the 80 open-access RCTs included in the study. The results of the bias assessments conducted by ChatGPT-4o for each domain were then compared with those of the meta-analysis authors using Cohen's weighted kappa analysis.
Weighted Cohen's kappa values showed better agreement in bias in the measurement of the outcome (D4, 0.22) and bias arising from the randomization process (D1, 0.20), while negative values in bias due to missing outcome data (D3, −0.12) and bias in the selection of the reported result (D5, −0.09) indicated poor agreement. The highest accuracy was observed in D5 (0.81), and the lowest in D1 (0.60). F1 scores were highest in bias due to deviations from intended interventions (D2, 0.74) and lowest in D3 (0.00) and D5 (0.00). Specificity was higher in D5 (0.93) and D3 (0.82), while sensitivity and precision were low in these domains.
The agreement between ChatGPT-4o and the meta-analysis studies in the same RCT assessments is generally low. This indicates that ChatGPT-4o requires substantial enhancements before it can be used as a reliable tool for bias risk assessments.
The AI–based tools have the potential to expedite bias assessment in systematic reviews. However, this study demonstrates that ChatGPT-4o, in its current form, lacks sufficient consistency, indicating that such tools should be integrated cautiously and used under continuous human oversight, particularly in evidence-based evaluations that inform clinical decision-making.