Assessing unknown potential—quality and limitations of different large language models in the field of otorhinolaryngology

Abstract Background Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. Aims/objectives Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). Material and methods Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. Results LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants’ answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. Conclusions and significance Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.


Introduction
The impact of Artificial Intelligence (AI) and Large Language Models (LLMs) on society is anticipated to be as groundbreaking as the industrial revolution [1].While the exact impact of AI is yet to become evident, the increasing influence of AI on many aspects of human society, in particular the delivery of healthcare, is already without question.Delivery of health care is currently facing huge challenges due to economic pressure and demographic change with widespread lack of access to adequate medical care in industrial as well as low-and middle-income countries commonplace [2][3][4].LLMs may have the potential to overcome some of the challenges of healthcare delivery especially in the areas of diagnosis, management, and referral [5].
For a long time, Natural Language Processing (NLP) tasks like question answering, reading comprehension, and summarization were typically performed by sequential models processing tasks in a word-by-word approach.However, sequential computation is limited in parallelization and inaccurate for large input data due to a lack of prioritization.In 2017, Vaswani et al. introduced the transformer model, providing a solution to these architecture-dependent deficits [6,7].Unlike sequential models, the transformer model provides a self-attention mechanism tracing global dependencies between words, enabling significantly more parallelization and thus accurate processing for large data input [7].Various well-known Natural Language Processing (NLP) tools including ChatGPT 4, Bard 2023.07.13, and Claude 2 rely on the transformer model.
Taking these developments into consideration and given the ubiquitous and low barrier access, patients and caregivers alike are likely to consult LLMs on medical queries, especially in scenarios with limited access to medical care.A recent study by our working group showed that the LLMs ChatGPT 3.5 and ChatGPT 4 were inferior to specialists in the specific medical field of Otorhinolaryngology (ORL) [8].Nevertheless, the performance of universal LLMs like ChatGPT 4 is respectable and promising, especially when considering the early stage of their development and that these universal models like ChatGPT 4 are not specially trained for medical purposes.Considering the current speed of development and improvement of universal LLMs, subsequent individual 'specialization' for certain fields or tasks as well as the economical commitment, the applicability of LLMs for health care services will certainly increase.Taking this into account, a specific evaluation of current LLMs is of high importance.ORL is a highly specialized field of medical care.However, typical pathologies in the field comprise a broad variety of diseases ranging from relatively 'harmless' conditions to life-threatening disease.Symptoms associated with severe disease and harmless conditions may often overlap highlighting the importance of a proper initial assessment.Moreover, owing to the high burden of disease, otolaryngology cases are the common focus of symptom-based internet searches by patients.
To assess the potential and limitations of LLM performance in the field of ORL [8][9][10][11][12][13][14][15], we benchmarked the medical performance of different LLMs including ChatGPT 4, Bard 2023.07.13, and Claude 2 against six experienced consultants working in both clinical and outpatient care by evaluating both the semantic qualities as well as the medical content of responses to case-based questions.

Materials and methods
One thousand four hundred case-based questions were retrieved from the ORL literature and German state examination exams for doctors.Cases that did not correlate to equivalent realistic clinical scenarios in the University Medical Center of Mainz were excluded from our study.The questions covered the categories ear (n = 14), nose (n = 8), head and neck (n = 13) and tumor (n = 6).After assessment of all questions, 41 common and realistic ORL case-based clinical questions (same as used in our previous study) were posed to the LLMs ChatGPT 4 (Open AI, San Francisco, USA), Bard 2023.07.13, now known as Gemini (Google, Mountain View, USA) and Claude 2 (Anthropic, San Francisco, USA) in October 2023 [8].For each LLM the base model was used without any tuning, resembling the most likely scenario of patients investigating their own symptoms.The same questions were answered by six ORL consultants working in University Medical Centers and two consultants based in an outpatient practice.The consultants had at least 7.5 years of clinical experience in ORL.
The answers were then blindly rated by the consultants in the categories of coherence, comprehensibility, conciseness, and medical quality on a 6-point Likert-scale (1 = very poor and 6 = excellent).As a modified Turing Test consultants also recorded whether they felt the answer was generated by a human or a LLM [16].To evaluate possible hazards consultants also assessed each answer for potential jeopardy to patient well-being.Since rating poses possible bias, all answers given were also compared to the validated answers provided in the study books [17,18].Finally, the character count for each answer was recorded.
For all queried data, normality distribution was tested with the D' Agostino and Pearson test.Since the data did not show a Gaussian distribution, comparisons between the two groups were conducted using the Mann-Whitney U test and multi-group comparisons with the Kruskal-Wallis test, respectively.
To evaluate correlations of the evaluated parameters to the character count, the nonparametric Spearman correlation test was performed.Data was collected in Microsoft Excel sheets (Microsoft, Redmond, WA, USA) and all statistical testing was conducted using Prism for Windows (version 9.5.1;GraphPad Software, La Jolla, CA, USA).
A comparison of the character count is shown in Figure 2 Correlations between the number of characters used and the specific qualities evaluated are described in Table 1.For ratings for Medical Adequacy, Comprehensibility, and Coherence a strong positive correlation to the number of characters used was determined, while the Conciseness showed a mild negative correlation for answers by the ORL consultants.In contrast, for answers by ChatGPT 4, a negative correlation between the number of characters and Conciseness was identified, while answers by Bard 2023.07.13 with a higher average character count correlated positively with Medical Adequacy and Coherence.
Except for comprehensibility, where all LLMs were rated comparably, statistically significant differences between the three tested LLMs were found.Of all tested LLMs Claude 2 was rated best in the categories of medical adequacy and conciseness and was only slightly surpassed by ChatGPT 4 for coherence.Bard 2023.07.13 on the other hand got the lowest ratings in every category.These results do in no way reflect the overall capabilities but solely the ratings in our specific field of analysis [29].
In concordance with our previously published data, the consultants outperformed the LLMs in every rated category (Medical Adequacy, Comprehensibility, Coherence, and Conciseness) [8].While the ratings strongly suggest the superiority of ORL specialists over the LLMs in answering case based clinical questions, the high overall quality of answers must also be considered.The comprehensibility of answers received the highest overall ratings for all LLMs.In this category, differences between LLMs and ORL consultants, while still statistically significant, were least pronounced.Taking these findings into account and considering the high ratings for coherence, our results underline the very high quality of semantic output now being generated by LLMs.In contrast, the overall ratings for medical adequacy, arguably the most important qualitative asset evaluated, show a more obvious discrepancy between the ORL  specialists and the LLMs.Noticeably, the ratings for all LLMs are still impressively high with Claude 2 providing the best and Bard 2023.07.13 the least medically adequate answers.Intriguingly, ratings for the conciseness of answers showed the biggest discrepancies between the ORL consultants and the LLMs, respectively.This aspect is especially interesting in relation to the character count.The LLMs utilized significantly more characters on each answer generated compared to the consultants, with Bard 2023.07.13 being the most verbose whilst achieving the lowest ratings in all evaluated categories.In contrast, Claude 2 made use of significantly less characters whilst getting the highest ratings for conciseness and medical adequacy.In the analysis with the Spearman rank test, a negative correlation for the number of characters was only detected in relation to the ratings for conciseness for ChatGPT 4 and coherence and medical adequacy for answers provided by Bard 2023.07.13, respectively.
Although Claude 2 received the highest rating for medical adequacy among the LLMs, ChatGPT 4 performed best in covering the validated answers.While the consultant answers were consistent with the validated solution in 92.68% of cases, ChatGPT 4 achieved 85.37%, Claude 2 78.05%, and Bard 2023.07.13 in 58.54%, respectively.We found ChatGPT 4 to perform significantly better compared to other studies, such as Hoch et al. who reported 57% correct answers from ChatGPT 4 on ORL board certification preparation questions, and Chee et al. who found 75% correct answers on vertigo scenarios, compared to 85.37% in our study [10,11].
Interestingly, all LLMs performed poorly interpreting the Weber test which can be considered a relatively simple 'transfer task' .On the other hand, all consultants answered the two questions dealing with Weber test results correctly (12/12 = 100%).ChatGPT 4 was the only LLM that generated a correct answer for one of the two questions regarding the Weber test (1/6 = 16.67%).This example illustrates LLM's potential to generate human-like responses but without the ability to 'think' like an experienced human counterpart.Possible explanations for the poor performance in the Weber test may originate in a lack of sufficient training data may result in 'hallucination' .Alternatively, deficits of LLMs in the detection of context may be attributable.Ultimately, due to the closed architecture of the LLMs and training data, the LLMs decision making is a black box and a definite explanation for the wrong responses cannot be made.
In this regard, the capacity to prioritize certain symptoms in relation to the prevalence and likelihood of certain diseases is what currently sets the ORL consultants apart.While the ORL specialists usually provided the most likely diagnosis and added a focused and relevant differential, LLMs provided a much broader, less focused differential diagnosis mostly without any prioritization relevant to the clinical case [15].While this limitation can be addressed by using prompts asking for ranking and structured answers [13], it is unlikely that patients would use this approach.However, for professional use in clinical practice prompts should be considered to generate more precise output.In the future, LLM services may provide specific options for medical consultancy or accessible user training on specific ORL topics.In time, more specialized training and narrowing down to a thematic field could result in more accurate and concise responses.
Case-based questions and Likert rating systems have limitations.On the one hand, case-based questions are advantageous due to their objective validated format and related answers and a broad range of ORL cases.Their validated wording reduces the risk of miscommunication but does not emulate the more likely questions posed by real patients.Further studies should therefore also evaluate questions originating from patients to feature in these factors.Moreover, a six point Likert scale, while suited for this type of study, has statistical limitations that must be taken into consideration when interpreting the results.While rating always poses possible bias, matching the answers given to validated answers can be considered objective.Since findings showed similar results to the ratings, the rating system seems valid although personal preferences may be featured in.
Accepting these limitations, this study still provides new important evidence for the diagnostic capability of this technology.While consultants are still superior to LLMs the gap between consultants and the LLMs is small.In a world suffering from a shortage of medical specialists and medical caregiving these results are promising especially in low resource settings, where internet access is often available but qualified medical personnel scarce.At present, some hazards to patients are still present in the responses from LLM based chatbots so they should still not be a substitute for a consultation with a trained professional.In a real-life consultation, much more can be achieved, such as non-verbal communication, physical examination, laboratory results, and imaging.These are crucial aspects of a consultation that LLMs simply cannot provide at present.Nevertheless, in the future, the combination of text and image analysis may enable LLMs to overcome some of these limitations.Although the upsides are obvious, critical aspects like the safety of personal data have to be carefully addressed [30].Moreover, reliability (reproducibility) is also an important factor in the comparative evaluation of LLM queries and consultants alike [31].Future studies should evaluate LLMs capabilities with real cases featuring aspects like miscommunication, machine-patient interaction, and reproducibility.

Disclosure statement
SK is the founder and shareholder of MED.digital.

Figure 2 .
Figure 2. the number of characters per answer used by OrL consultants and the different LLMs (chatgPt 4, Bard 2023.07.13, claude 2).Data shown as a scatter dot blot with each point resembling an absolute value.Horizontal lines represent the median.normality distribution was tested with the D'Agostino and Pearson test.Multi-group comparisons were performed using the Kruskal-Wallis-test. **** p < .001.