D. J. Wu1, and J. E. Bibault2; 1Stanford University, Palo Alto, CA, 2Georges Pompidou European Hospital, Paris, 75015, France
Purpose/Objective(s): Recent advances in artificial intelligence such as large language models (LLMs) offer a promising avenue for enhancing clinical documentation and monitoring patient-reported outcomes (PRO). This study aims to compare four leading open-source and proprietary LLMs, Mixtral-8x7B, Llama-2B, Qwen-1.5, and GPT-4, in generating summaries of patient-reported symptoms using the adapted Physician Documentation Quality Index (PDQI). Materials/
Methods: A previously reported web-based application utilizing 35 items from the PRO-CTCAE scale was used to create an interactive form for breast cancer patients to report treatment-related symptoms. The four LLMs were used to generate natural language summaries for four hypothetical patients with non-identifiable patient data. Twelve resident physician raters evaluated the summaries using an abbreviated PDQI questionnaire, rating accuracy, usefulness, comprehensibility, succinctness scored on a 5-point Likert scale. IRB approval was not required in accordance with the NIH 2018 Revised Common Rule Requirements as the study used researcher-generated, non-identifiable data. Results: Forty-seven physician ratings were collected. A repeated measures ANOVA showed significant differences in accuracy among the models(F(2.120, 97.52) = 15.30, p < 0.0001), with Mixtral-8x7b (M=3.60), GPT-4 (M=3.78), and Qwen-1.5 (M=3.62) significantly surpassing Llama-2 (M=2.77, p = 0.001), with no significant difference between the three. Mixtral-8x7b (M=3.47) outperformed Llama-2 in usefulness (M=2.87, p<.05) and outperformed Llama-2 and Qwen in succinctness (p<.05). No significant differences were found in comprehensiveness. Reviewers noted 6 mistakes in Llama-2 summaries, and 1 mistake each in Mixtral-8x7b and Qwen-1.5 summaries. Conclusion: This study demonstrates that the latest open-source LLMs, such as Mixtral-8x7b and Qwen-1.5, can match the performance of their closed-source counterpart, GPT-4, in physician-rated measures of documentation quality and outperform their predecessor, Llama-2. This study highlights the narrowing gap between open-source and proprietary LLMs in medical documentation. These findings may help inform the strategic selection of cost-effective and data-safe LLMs for future clinical research and practice integration, potentially democratizing advanced AI tools for a broader healthcare audience. However, further validation in real-world clinical settings is necessary to assess the impact of these models on patient care efficiency and efficacy.