Few artificial intelligence (AI) clinical decision support systems (CDSSs) are ever evaluated in practice. Although some signal of clinical effectiveness may be needed to justify AI deployment and testing, such data are typically unavailable in early-stage research. This conundrum is especially relevant in the intensive care unit (ICU), where conditions like sepsis and acute respiratory distress syndrome (ARDS) require high-stakes decisions. Our group developed the AI ventilator assistant (AVA), a novel AI CDSS for patients with sepsis ARDS receiving invasive mechanical ventilation. But the promising results of predictive performance estimates are not sufficient to assess AVA’s clinical safety and appropriateness prior to future evaluation and deployment. Therefore, we propose a Clinician Turing Test as a novel validation approach to determine whether clinicians can distinguish AVA-generated treatment recommendations from those enacted by real human clinicians. If AVA’s recommendations are consistently indistinguishable from those of real clinicians, thereby ‘passing’ this Turing test, this would provide a strong preclinical signal of safety and appropriateness.
This multisite, randomised, electronic, vignette-based Phase 1b study will use a Clinician Turing Test design. We aim to recruit 350 critical care clinicians, including physicians and advanced practice providers from six US hospitals. Participants will review nine clinical vignettes of patients with sepsis and ARDS derived from the Molecular Epidemiology of Severe Sepsis in the ICU cohort and an associated profile of a suggested treatment plan. For each participant–vignette combination, the source of the treatment profile will be randomly assigned (AI-generated by AVA vs the actually enacted treatment from real human clinicians) in a 1:1 allocation. The primary endpoint is the participants’ accuracy in identifying whether a treatment profile was AI-generated or human-generated, assessed using equivalence testing through a mixed-effects logistic regression model with random effects for participants and vignettes. Secondarily, a fitted binary classifier will assess discrimination ability using the C-statistic. Secondary endpoints include clinicians’ perceptions of the safety and appropriateness of the treatment profiles, confidence in distinguishing AI-generated and human-generated recommendations, interest in AI CDSSs for sepsis and ventilator management and the time to complete the survey. This novel Phase 1b design provides preliminary but essential information about an AI CDSS’s clinical appropriateness without the risk or cost of actual deployment, thereby informing decisions about future clinical implementation and evaluation in real clinical environments.
This protocol was approved by the Institutional Review Board of the University of Pennsylvania (Protocol #858201). Results are expected in 2026 and will be submitted for publication in peer-reviewed journals and presented at scientific conferences.
by Shirley Ge, Hope Lappen, Luz Mercado, Kaylee Lamarche, Theodore J. Iwashyna, Catherine L. Hough, Virginia W. Chang, Adolfo Cuevas, Thomas S. Valley, Mari Armstrong-Hough
BackgroundRacial and ethnic disparities in the delivery and outcomes of critical care are well documented. However, interventions to mitigate these disparities are less well understood. We sought to review the current state of evidence for interventions to promote equity in critical care processes and patient outcomes.
MethodsFour bibliographic databases (MEDLINE/PubMed, Web of Science Core Collection, CINAHL, and Embase) and a list of core journals, conference abstracts, and clinical trial registries were queried with a pre-specified search strategy. We analyzed the content of interventions by categorizing each as single- or multi-component, extracting each intervention component during review, and grouping intervention components according to strategy to identify common approaches.
ResultsThe search strategy yielded 11,509 studies. Seven-thousand seventeen duplicate studies were removed, leaving 4,491 studies for title and abstract screening. After screening, 93 studies were included for full-text review. After full-text review by two independent reviewers, eleven studies met eligibility criteria. We identified ten distinct intervention components under five broad categories: education, communication, standardization, restructuring, and outreach. Most examined effectiveness using pre-post or other non-randomized designs.
ConclusionsDespite widespread recognition of disparities in critical care outcomes, few interventions have been evaluated to address disparities in the ICU. Many studies did not describe the rationale or targeted disparity mechanism for their intervention design. There is a need for randomized, controlled evaluations of interventions that target demonstrated mechanisms for disparities to promote equity in critical care.