Abstract
Objective
Large language models (LLMs) are used in various fields for their ability to produce human-like text. They are particularly useful in medical education, aiding clinical management skills and exam preparation for residents. To evaluate and compare the performance of ChatGPT (GPT-4), Gemini, and Bing with each other and with otorhinolaryngology residents in answering in-service training exam questions and provide insights into the usefulness of these models in medical education and healthcare.
Methods
Eight otorhinolaryngology in-service training exams were used for comparison. 316 questions were prepared from the Resident Training Textbook of the Turkish Society of Otorhinolaryngology Head and Neck Surgery. These questions were presented to the three artificial intelligence models. The exam results were evaluated to determine the accuracy of both models and residents.
Results
GPT-4 achieved the highest accuracy among the LLMs at 54.75% (GPT-4 vs. Gemini p=0.002, GPT-4 vs. Bing p<0.001), followed by Gemini at 40.50% and Bing at 37.00% (Gemini vs. Bing p=0.327). However, senior residents outperformed all LLMs and other residents with an accuracy rate of 75.5% (p<0.001). The LLMs could only compete with junior residents. GPT-4 and Gemini performed similarly to juniors, whose accuracy level was 46.90% (p=0.058 and p=0.120, respectively). However, juniors still outperformed Bing (p=0.019).
Conclusion
The LLMs currently have limitations in achieving the same medical accuracy as senior and mid-level residents. However, they outperform in specific subspecialties, indicating the potential usefulness in certain medical fields.
Introduction
The emergence of artificial intelligence (AI) has drawn a lot of attention towards large language models (LLMs), a member of natural language processing tools. These models are highly proficient in processing and generating text resembling human speech. They are created using advanced deep-learning techniques and comprehensive datasets from the internet (1).
LLMs are sophisticated AI systems designed to understand, interpret, and generate human language in a meaningful and contextually relevant way. They can identify patterns, comprehend context, and link various pieces of information, abilities that make them capable of providing insightful responses and advice on a vast range of subjects (2).
These models, including OpenAI’s ChatGPT, Google’s Gemini, and Microsoft’s Bing, are trained using machine learning, where they learn to predict and generate text based on the patterns they observe in the training data. This allows them to perform language-related tasks like translation, summarization, and question-answering (3).
A variety of studies have explored the effectiveness of LLMs in some exams. Specific research has been conducted on other professions, such as the University of Minnesota Law School, the Bar Exam, the Wharton Master of Business Administration, and accounting exams, even without fine-tuning the pre-trained model (4-7). In the medical field, some studies have assessed the performance of ChatGPT in the United States Medical Licensing Examination (8, 9). Further, some studies compared the performance of different LLMs in various exams like Ophthalmic Knowledge Assessment Program and Neurosurgery Oral Board (10, 11).
Some studies in the field of otorhinolaryngology explored the effectiveness of ChatGPT. One study examined the success rate of ChatGPT and found that it could pass the Royal College of Physicians and Surgeons of Canada Otorhinolaryngology Board Exam (12). Another study explored the usefulness of ChatGPT in the board preparation process by examining quiz skills in various otolaryngologic subspecialties (13).
This study aimed to evaluate the performance of LLMs, namely ChatGPT, Gemini, Bing, and resident surgeons, in the otorhinolaryngology in-service examination (ORLITE). The study also intended to compare the accuracy of each model in various otorhinolaryngology topics. We believe that exploring the potential effectiveness of AI in medical education can assist medical students and potentially improve their exam performances.
Methods
Study Design
This cross-sectional observational study compared the responses of LLMs, namely, ChatGPT, Google Gemini, Microsoft Bing, and the otorhinolaryngology residents in ORLITE. The study did not require ethics committee approval as it relied solely on the question database of the university clinic, which was derived from publicly available online medical textbooks.
Development and Implementation of the ORLITE
ORLITE is an exam designed to assess the periodic competencies of residents specializing in otorhinolaryngology at a tertiary-level university hospital. The questions for the ORLITE are based on the Resident Training Textbook, which is available on the Turkish Society of Otorhinolaryngology-Head and Neck Surgery website (https://www.kbb.org.tr). The exam content is created collaboratively by five experienced and board-certified faculty members, ensuring a consensus-driven approach to the exam content. It is conducted four times in a single academic year, and each consists of 40 questions, including multiple-choice, multiple-select, free-response, and image-based questions. These questions include topics in general otorhinolaryngology, otology/neurotology, rhinology, head and neck surgery, and laryngology. They are prepared in the Turkish language. Each correct answer is rewarded with 2.5 points, with no negative marks for incorrect choices. Residents are given 40 minutes to complete the exam.
Selection of the Questions and Querying Process
Three-hundred-and-twenty questions retrieved from eight ORLITE sessions applied over the last two academic years, 2021-2022 and 2022-2023, were reviewed. Four of these questions were excluded due to incomprehensible image-based content, leaving 316 multiple-choice, multiple-selection, short-answer, and image-based questions to be presented to different LLMs separately. The querying process was conducted from February 15 to February 18, 2024, by an otorhinolaryngology specialist using the website of each model. Each question was asked individually, and the page was refreshed each time to prevent the relevant LLM from establishing connections with previous questions and forming memory. Before each question, the models were prompted with the following message: “Hello, you are a physician currently undergoing training in otorhinolaryngology. You will be answering questions related to the resident training exam conducted at the otorhinolaryngology department of the university clinic. You are only required to indicate the correct option. Are you ready?” The generated responses were marked as correct versus incorrect and recorded.
Description of LLMs and the Resident Surgeons
During the study, various chatbots capable of producing human-like responses were tested. The chatbots used in the study included the subscription-based, paid version ChatGPT, an upgraded version of ChatGPT 3.5 developed by OpenAI and released in March 2023. In addition, the Gemini, a product of Google DeepMind introduced in December 2023, and Bing Chat, which is reported to utilize ChatGPT architecture and was made available to Edge users by Microsoft in February 2023, were also employed. The 22 human participants of the study were resident surgeons who had studied in a department of the otorhinolaryngology clinic of the university hospital in the years indicated. They were divided into three groups based on their five-year specialization training periods: the first 1.5 years as junior (3rd resident), the following two years as mid-level (2nd resident), and the final 1.5 years as senior (1st resident). There were 7 junior residents, 10 mid-level residents, and 5 senior residents in their respective groups. The success of their exams was then categorized in terms of points.
Statistical Analysis
Statistical analyses were performed to evaluate the overall success rate for each chatbot model and resident, which were calculated as the percentage of correct answers. Independent sample t-tests were applied to compare the accuracy values of chatbots and residents. Descriptive statistics such as mean and standard deviation were preferred to evaluate the performance of each exam. A significance level of α=0.05 was set. Statistical data analysis was performed using IBM SPSS 28.0 (IBM Corp. Released 2021. IBM SPSS Statistics for Windows, Version 28.0. Armonk, NY: IBM Corp.) Statistical Software package.
Results
Evaluation of the Performance of LLMs versus Residents in the ORLITE
In the ORLITE, ChatGPT outperformed Gemini and Bing with an accuracy of 54.75%, establishing itself as the leading model (p=0.002 and p<0.001, respectively). Gemini and Bing achieved similar accuracies of 40.50% and 37.00%, respectively, with a non-significant difference (p=0.327). These results highlight the exceptional comprehension and logical reasoning abilities of ChatGPT compared to other models.
The results revealed that senior residents had the highest accuracy rate of 75.50%, outperforming all LLMs and other residents (p<0.001). Mid-level residents showed superior results with an accuracy of 63.45% compared to Gemini (p<0.001) and Bing (p<0.001), though ChatGPT approached their performance levels (p=0.013). LLMs were found to be competitive only with junior residents. Junior residents achieved a success rate of 46.90%, outperforming Gemini and Bing (p=0.019), but could not pass ChatGPT. However, the differences in performance between junior residents and ChatGPT and between junior residents and Gemini were not statistically significant (p=0.058 and p=0.120, respectively). Gemini and Bing showed the lowest accuracy scores among them all. Table 1 summarizes, and Figure 1 illustrates, the performance of LLMs compared to residents.
T-tests were used to analyzea the accuracy differences between the LLMs among themselves and when compared with residents. Statistically, the significance level is considered as p<0.05.
Comparison of the Accuracy Rates of LLMs and Residents in ORLITE Per Examination
We analyzed the accuracies of eight ORLITE exams and recorded the results. The standard deviations for each model and residents were as follows: ChatGPT (7.60%), Gemini (7.30%), Bing (7.00%), senior resident (5.80%), mid-level resident (3.80%), and junior resident (7.90%). The results were more consistent among senior and mid-level residents. Figure 2 illustrates the performance of LLMs and residents per the ORLITE exam.
Investigation of the Accuracies of LLMs in Subspecialties of Otorhinolaryngology
We evaluated the performance of LLMs across various subspecialties within otorhinolaryngology, including general Ear Nose Throat (ENT), otology, rhinology, laryngology, and head and neck surgery. ChatGPT demonstrated the highest accuracy across most subspecialties, with notable performance in head and neck and rhinology, achieving accuracy rates of 59.40% and 55.60%, respectively. Gemini, on the other hand, showed consistent but more moderate success, with its highest accuracy, same as ChatGPT in head and neck and rhinology at 42.20% and 42.90%, respectively. While generally lower in all subspecialties than ChatGPT and Gemini, Bing was competitive in laryngology, where it nearly matched ChatGPT with an accuracy rate of 48.10%. Figure 3 demonstrates the accuracy rates of each model in different otorhinolaryngology fields.
Examples of ORLITE questions and corresponding responses provided by ChatGPT, Gemini, and Bing, are presented in Table 2.
Discussion
Three AI models were used in the study: OpenAI ChatGPT Plus (ChatGPT), Google Gemini, and Microsoft Bing. ChatGPT can be accessed through a subscription option, while the others are free to access on their website. It is worth noting that Bing prefers the ChatGPT infrastructure model, but Gemini works with a different type called Palm2 (14, 15). This study is the first of its kind to evaluate the performance of AI models in otorhinolaryngology exams and to compare them with humans specializing in the field.
In previous literature, there have been inconsistent results in studies comparing the performance of different LLMs in various medical fields. For instance, one study on neurosurgery oral board exam preparation question bank concluded that ChatGPT was more effective than Gemini in responding to advanced knowledge queries, and another study on answering frequently asked questions about lung cancer found that ChatGPT was more accurate than Google Gemini (11, 16). However, a study on the Royal College of Ophthalmologists fellowship exams presented a contrasting perspective. It showed that Bing Chat outperformed other AI systems, including the lowest-ranked ChatGPT (17).
Previously, there have been studies examining the performance of ChatGPT in otorhinolaryngology and with various methodologies. Kuşcu et al. (18) conducted a study on the performance of ChatGPT in answering frequently asked questions about head and neck cancers. The results showed that the model had a high success rate, with 86.4% of responses being comprehensive and correct. Radulesco et al. (19) investigated the ability of ChatGPT to diagnose rhinological clinical cases accurately. They achieved a 62.5% correct or plausible response rate, and the stability of responses was moderate to high.
The performance of an LLM is mainly influenced by the model’s architecture, the amount of diverse training data, the duration of training, and the allocation of resources. To improve the effectiveness of the model, it is crucial to optimize techniques and fine-tune hyperparameters during training. In fields like medicine, where knowledge bases are rapidly evolving, up-to-date and relevant training data are vital. Customization through additional training for specific tasks or sectors can help optimize performance. Additionally, the linguistic and cultural diversity of the training data affects the model's effectiveness across different languages and cultural contexts (20).
LLMs are known to perform differently in various languages, which is often linked to the amount and quality of training data available. Since a considerable amount of online content is in English, LLMs in English usually have better comprehension capabilities and access to a broader knowledge base. A technical report released on GPT4 revealed that GPT3.5 and PaLM perform 70.1% and 69.3% accuracy in massive multitask language understanding studies, respectively, whereas GPT4 shows 85.5% in English (21). It has been found that Turkish has a similarity rate of 80%, closely matching Italian at 84.1%, German at 83.7%, and Korean at 77%. However, the success rate drops for Nepali to 72.2%, Thai to 71.8%, and Telugu to 62% (22).
Furthermore, the complexity of a language’s structure, its grammatical rules, and cultural factors can influence the model’s performance (23). Languages with more complex grammatical features, such as gender, case, and tense, may pose greater challenges for LLMs. Nonetheless, technological advancements and the increasing use of multilingual models have significantly improved performance in other languages. This progress can make language models more universally applicable, providing better services to users in different languages.
As the world advances in every manner, educational models continuously evolve from traditional to more technology-based styles (23). Residents in various specialty areas and students in medicine can benefit from using LLMs to enhance their learning and clinical experiences. LLMs are advanced repositories of medical knowledge that provide instant access to a wide range of medical literature and research, making it easier to learn and make decisions based on evidence. They provide personalized education by offering responses tailored to specific queries, allowing residents and students to explore complex medical scenarios. LLMs also aid in developing differential diagnoses by providing conditions based on existing symptoms, which can help with clinical reasoning and decision-making processes (24).
In addition to their primary functions, these models can be used to interpret medical data such as laboratory results and radiographic images. The models can help medical professionals make more informed decisions by providing contextual information and potential implications. LLMs can also be used for language translation in medical contexts. This is particularly useful in understanding medical texts in various languages, promoting a more global medical perspective. LLMs also offer the potential for simulation-based learning, where residents and students can engage in virtual patient scenarios to enhance their diagnostic and therapeutic skills in a safe environment (25).
There are a few limitations to this study that are worth noting. Firstly, we only examined three of the most commonly used LLMs today: ChatGPT, Gemini, and Bing. While these models are widely available, many others could also be included in more comprehensive studies to further our knowledge in this field. Secondly, one potential limitation is that the reproducibility of the chatbots' responses varied with each query. This is something to keep in mind when interpreting the results. Thirdly, the questions used in the study were based on a specific textbook formatted according to the universal medical literature by the Turkish Society of Otorhinolaryngology Head and Neck surgery. It is important to consider that different question formats based on various database sources may produce different outcomes. Lastly, it is worth noting that the questions were designed to comply with the International Test Commission guidelines (26). However, minor discrepancies may occur since they were prepared by a joint commission of five faculty members.
Conclusion
Our study suggests that LLMs currently have limitations in achieving the same medical accuracy as senior resident surgeons. However, that the performance of ChatGPT is comparable to that of mid-level residents-and they excel in specific subspecialties-indicates the potential usefulness in certain medical fields. Meanwhile, Gemini and Bing show promise as valuable resources for education and the initial stages of clinical support, as their accuracy levels are similar to junior surgeons. Nevertheless, the performance of these models varies across different subspecialties, highlighting the need for the development and application of tailored LLMs to meet the requirements of each field.
Main Points
• Technology and artificial intelligence are becoming ever-increasingly popular and being integrated into our lives. Artificial intelligence products known as Large Language Models (LLMs), such as chatbots, generate human-like responses and problem-solving skills. Their abilities to solve exam questions such as USMLE, Bar, and MBA have been investigated.
• Although examined in cardiology, ophthalmology, orthopedics, obstetrics, gynecology, and otorhinolaryngology, the comparison of the success of LLMs and human counterparts has not yet been investigated.
• In this study, the performance of LLMs in exams applied for resident training at a University Clinic was compared among themselves and with residents at three levels of seniority, using the Turkish Otorhinolaryngology Head and Neck Society Assistant Training Basic Textbook as reference.
• ChatGPT was more successful than other LLMs in total and across all subspecialties of otorhinolaryngology. Bing showed close success to ChatGPT in the field of laryngology. The senior residents was the most successful, while ChatGPT approached the performance of the mid-level residents. ChatGPT and Gemini achieved results similar to those of the junior residents.
• LLMs are far from senior residents’ levels of knowledge, skills, and experience under current conditions. However, they can be preferred for their supportive features in the early years of resident training.