- Research
- Open access
- Published:
Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis
BMC Oral Health volume 25, Article number: 573 (2025)
Abstract
Background
Artificial intelligence (AI) has rapidly advanced in healthcare and dental education, significantly impacting diagnostic processes, treatment planning, and academic training. The aim of this study is to evaluate the performance differences between different large language models (LLMs) by analyzing their accuracy rates in answers to multiple choice oral pathology questions.
Methods
This study evaluates the performance of eight LLMs (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek) in answering multiple-choice oral pathology questions from the Turkish Dental Specialization Examination (DUS). A total of 100 questions from 2012 to 2021 were analyzed. Questions were classified as “case-based” or “knowledge-based”. The responses were classified as “correct” or “incorrect” based on official answer keys. To prevent learning biases, no follow-up questions or feedback were provided after the LLMs’ responses.
Results
Significant performance differences were observed among the models (p < 0.001). ChatGPT o1 achieved the highest accuracy (96 correct, 4 incorrect), followed by Claude (84 correct), Gemini 2 and Deepseek (82 correct each). Copilot had the lowest performance (61 correct). Case-based questions showed notable performance variations (p = 0.034), where ChatGPT o1 and Claude excelled. For knowledge-based questions, ChatGPT o1 and Deepseek demonstrated the highest accuracy (p < 0.001). Post-hoc analysis revealed that ChatGPT o1 performed significantly better than most other models across both case-based and knowledge-based questions (p < 0.0031).
Conclusion
LLMs demonstrated variable proficiency in oral pathology questions, with ChatGPT o1 showing higher accuracy. LLMs shows promise as a supplementary educational tool, though further validation is required.
Background
Artificial intelligence (AI) is a rapidly developing branch of technology that aims to develop intelligent systems capable of performing tasks based on human intelligence [1]. In the last twenty years, rapid advancements in big data, computational power, and AI algorithms have begun to simplify people’s lives [2]. AI technologies are advancing in the health sciences, particularly in the fields of medicine and dentistry, and are leading to significant changes in these areas. In particular, the potential of AI is becoming increasingly evident in diagnostic processes, treatment planning, and education [3]. These advancements in AI technology, along with its integration into routine tasks, are rapidly transforming the fields of healthcare and education. This transformation is increasingly extending into medical and dental education, offering various advantages to both students and educators [4]. AI models and networks, by processing extensive data quickly, can enable objective and comprehensive clinical and histopathological analyses that may be beneficial in improving treatment methods and prognostic outcomes [5]. However, there is a limited number of studies on the extent to which these AI models process theoretical knowledge accurately and how they perform compared to human performance in multiple-choice examinations [6,7,8,9].
Oral and maxillofacial pathology (OMP) includes various diseases and disorders of the oral cavity, jaws, and facial structures. The accurate diagnosis and treatment of these pathologies require a detailed examination of their etiology, clinical findings, and histopathological features [10]. OMP, one of the cornerstones of dental education, plays a critical role in developing students’ clinical decision-making skills [11]. In clinical practice, because many oral diseases are asymptomatic in their early stages, comprehensive theoretical knowledge and clinical experience are necessary to establish an accurate diagnosis [12]. Consequently, the knowledge and skills imparted in oral pathology courses are of great importance not only in the clinical practice of dentists but also in their roles within multidisciplinary healthcare teams.
In this study, eight different models (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek) were used to compare the performance of various AI models in answering multiple-choice oral pathology questions. These models belong to the category of large language models (LLMs) and possess the ability to analyze and respond to complex questions through natural language processing (NLP) algorithms [13]. The aim of this study is to evaluate the performance differences among these models by analyzing the accuracy rates with which each model answers multiple-choice oral pathology questions.
Materials and methods
This study was conducted on February 5, 2025, in Türkiye. In the study, Dental Specialization Entrance Examination (DUS) questions were asked in original language (Turkish) to various LLMs (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek). DUS is an exam organized by the Student Selection and Placement Center (OSYM) that aims to assess dentists’ candidacy entry into specialty training at universities in Türkiye. This exam, which was held once a year between 2015 and 2021, was held twice a year (in the spring and fall) during 2012–2014 and has been conducted twice a year from 2023 onwards. The DUS consists of a total of 120 questions, comprising 40 basic science questions and 80 clinical science questions. Oral pathology questions appear in a fixed number in the basic sciences section each year, while in the clinical sciences section they are posed in varying numbers under different specialties. Questions on topics such as Odontogenic and Developmental Jaw Cysts, Odontogenic Tumors, Mucosal and Tongue Diseases, Salivary Gland Pathologies, Bone Disease Pathology, Infectious Diseases Affecting the Oral and Perioral Tissues, and Environmental Injuries asked in both basic and clinical sciences have been classified as oral pathology questions. In this study, all oral pathology questions from the DUS exams conducted between 2012 and 2021 were analyzed, and the performance of AI-based LLMs in answering these questions was evaluated. The questions were accessed from the ÖSYM website and are published on the website openly to everyone (https://www.osym.gov.tr/TR,15070/dus-cikmis-sorular.html). Since this study does not involve human or animal data, it does not require ethical committee approval.
Sample size determination
In this study, a power analysis was conducted to determine the minimum number of questions required to ensure sufficient statistical power. The analysis indicated that at least 97 questions were necessary, with a significance level (α) of 0.05, a statistical power (1-β) of 0.80, and an assumed effect size (d) of 0.5. However, to enhance the validity and comprehensiveness of the findings, a total of 100 multiple-choice questions were included in the study. The questions used in this study were selected from the DUS and covered various topics such as oral lesions, oral pathology, and systemic diseases. The inclusion of the entire dataset ensured a broader evaluation of the LLMs’ performance and eliminated potential biases due to selective sampling.
Inclusion and exclusion criteria for questions
This study utilized a total of 100 multiple-choice questions selected from the DUS. There were 16 questions on Odontogenic and Developmental Jaw Cysts, 17 questions on Odontogenic Tumors, 30 questions on Mucosal and Tongue Diseases, 20 questions on Salivary Gland and Bone Diseases Pathology, and 17 questions on Infectious Diseases and Environmental Injuries Affecting the Mouth and Perioral Tissues. Each question had five answer choices. Questions can be accessed from the OSYM website. The questions were categorized by two experts (specialist physicians in relevant fields) into “case-based” (involving clinical scenarios that assess decision-making skills) and “knowledge-based” (directly based on theoretical information). To ensure agreement, the researchers (BEY and BNGY) conducted independent evaluations, and any discrepancies were resolved through consultation with a third researcher (FO).
Operation of large language models (LLMs)
On February 5, 2025, new accounts were created for each AI program evaluated in this study (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek). Cookies and internet history were deleted before queries were made. All questions were entered into the respective LLMs by a single researcher (BEY) on the same day the accounts were created. It was planned to re-ask the questions in case of a freeze or delay in answering the questions in the program. However, no failure was experienced in any program. The models were tested using their latest publicly available versions. They were evaluated using default configurations without any parameter adjustments or additional prompts. Multiple-choice questions were directly input into the systems without modifications. All artificial intelligence models were presented with the same set of questions, utilising the same format, wording and order as those employed in the examination of dental specialisation candidates. The data received by the LLMs were meticulously documented by a researcher (BEY), and the other two researchers (BNGY, FO) took turns checking the accuracy of the answers. This approach ensured the accurate documentation of the LLM’s responses. Each model received the questions only once to prevent biases from repeated exposure or learning effects.
Performance evaluation method
On February 5, 2025, all questions were simultaneously presented to the LLMs. The responses were classified as “correct” or “incorrect” based on official answer keys. Each model’s accuracy rate was analyzed, and performance was further evaluated based on the accuracy rates for case-based and knowledge-based questions. Additionally, accuracy rates were examined according to different subject areas. To prevent learning biases, no follow-up questions or feedback were provided after the LLMs’ responses.
Statistical analysis
Descriptive statistics (frequencies) were reported in this study. The Pearson Chi-square test was used to assess the relationship between categorical variables. Following significant findings from the Chi-square analysis, pairwise comparisons among 8 large language models (Gemini 1.5, Gemini 2, ChatGPT 4o, ChatGPT 4, ChatGPT o1, Copilot, Claude 3.5, Deepseek), evaluated through their responses to 100 oral pathology DUS questions, were conducted using a post-hoc Chi-square analysis with Bonferroni correction. This post-hoc analysis was carried out separately for all questions, case-based questions, knowledge-based questions, and the subgroup of mucosal and tongue diseases. The significance level for the Chi-square tests was set at p < 0.05, and after Bonferroni correction for post-hoc comparisons, the significance level was adjusted to p < 0.0031.
Results
In this study, the performance of AI-based LLMs on DUS oral pathology questions was examined. Based on the overall performance evaluation presented in Table 1, a statistically significant difference between the models was observed (p < 0.001). The model with the highest number of correct answers was ChatGPT o1 (96 correct, 4 incorrect), followed by Claude (84 correct, 16 incorrect), and then Gemini 2 and Deepseek (both with 82 correct and 18 incorrect). Copilot, with 61 correct and 39 incorrect answers, was the lowest-performing model. The distribution of correct answers for the models is presented in Fig. 1. Considering all questions, Copilot showed significantly different performance compared to Gemini 1.5, Gemini 2, Deepseek, Claude, ChatGPT 4o, and ChatGPT o1 models (p < 0.0031) (Table 2). Additionally, significant differences were identified between ChatGPT o1 and Gemini 1.5, Gemini 2, Deepseek, Claude, ChatGPT 4o, and ChatGPT 4 (p < 0.0031). Detailed comparisons of these analyses are presented in Table 2.
In Table 3, the performance of LLMs was examined in greater detail according to question type (case-based and knowledge-based). In case-based questions, a statistically significant difference was found among the models (p = 0.034). In this category, the model with the highest number of correct answers was ChatGPT o1 with 27 correct answers (p = 0.330), while Claude demonstrated a similar performance with 26 correct answers. For case-based questions, significant differences were found between the Claude model and the ChatGPT 4 model, and also between ChatGPT 4 and ChatGPT o1 models (p < 0.0031). Detailed comparisons of these analyses are presented in Table 4. In knowledge-based questions, a statistically significant difference among the models was also observed (p < 0.001), and once again, ChatGPT o1 achieved the highest accuracy with 69 correct and 2 incorrect answers (p = 0.330). Claude also showed high success in knowledge-based questions by providing 58 correct and 13 incorrect answers; the difference between these two question types was found to be statistically significant (p < 0.001). In knowledge-based questions, Copilot exhibited significant differences compared to Gemini 1.5, Gemini 2, Deepseek, Claude, ChatGPT 4o, and ChatGPT o1 models (p < 0.0031). Additionally, ChatGPT o1 showed significant differences compared to Gemini 1.5, Gemini 2, Copilot, Claude, ChatGPT 4o, and ChatGPT 4 models (p < 0.0031). Detailed comparisons of these analyses are presented in Table 5.
DUS oral pathology questions are divided into five topics (Table 6). These topics are: Odontogenic and Developmental Jaw Cysts, Odontogenic Tumors, Mucosal and Tongue Diseases, Salivary Gland, Bone Disease Pathology, Infectious Diseases Affecting the Oral and Perioral Tissues, and Environmental Injuries. The distribution of correct answer rates of LLMs across these five topics is presented in Table 6. Among the models, a statistically significant difference was detected only in the “Mucosal and Tongue Diseases” category (p < 0.001). In this category, the best performing model was ChatGPT o1, which answered all questions correctly, while the worst performance was observed in the Copilot model (19 correct, 11 incorrect). In the subgroup of mucosal and tongue diseases, the Copilot model was significantly different only from the ChatGPT o1 and Deepseek models (p < 0.0031). Additionally, significant differences were identified between the ChatGPT o1 model and Copilot and ChatGPT 4 models (p < 0.0031). Detailed comparisons of these analyses are presented in Table 7.
Discussion
ChatGPT, Gemini, Claude, Deepseek, and Copilot are AI-powered LLMs that are also used for content production [14]. The applications of LLMs in the medical field are becoming increasingly popular [15], however, using these applications without understanding their reliability in different topics may reduce their effectiveness. In this study, the performance of AI-based LLMs in addressing DUS oral pathology questions was evaluated in detail. The results obtained are consistent with similar studies conducted in various fields in the literature, demonstrating that ChatGPT achieves high accuracy rates in answering medical questions among the LLMs [16, 17].
Based on the overall performance evaluation (Table 1), it was observed that ChatGPT o1 exhibited significantly higher success compared to the other models. This finding suggests that ChatGPT o1 may possess a more comprehensive knowledge base and greater text understanding capacity due to the extensive dataset it was trained on and its advanced model architecture. On the other hand, the similar average success rates of models such as Claude, Gemini 2, and Deepseek reflect that these platforms are also continuously enhancing their language processing capabilities. According to current study’s results, ChatGPT o1 is followed by Claude, Deepseek, and Gemini 2, with Copilot being the least successful. In Taşsöker’s study, radiology questions were administered, and ChatGPT 4o was identified as the most successful application, whereas Copilot was found to be the least successful [6]. In the study conducted by Kim et al., dental undergraduate exam questions were posed to the Claude, ChatGPT 3, and ChatGPT 4 programs, and the Claude program was found to be more successful compared to ChatGPT 4 and ChatGPT 3 [18]. Similarly, a study by Gilson et al. noted that ChatGPT achieved the predicted level of success in various medical examinations [16]. In another study by Avsar and Ertan, which involved prosthetic dental treatment specialty questions, ChatGPT 3.5 and Gemini were compared; although Gemini provided more correct answers, no statistically significant difference was found between them [19]. When current study’s results are examined alongside the literature, it appears that different AI applications exhibit varying success rates depending on the specialty. Therefore, these findings suggest that an AI program may not be equally useful as an aid across all fields.
In the subgroup analyses of the study (Table 3), the statistically significant differences in performance between case-based and knowledge-based question types (p < 0.001) indicate that the LLMs’ ability to generate responses varies according to the content type of the question. In particular, the superior performance of ChatGPT o1 and Claude on case-based questions suggests that these models may possess more advanced skills in interpreting clinical scenarios and reasoning. In a study conducted by Buhr et al., in which case-based questions were posed to LLMs, ENT physicians validated 98.4% of ChatGPT’s responses [20]. In contrast, for knowledge-based questions, the high accuracy achieved by ChatGPT o1 and Deepseek implies that these models have robust access to their existing knowledge bases. In the study by Deveci et al., it was reported that ChatGPT 4 performed more successfully than Gemini and Copilot [21]. This situation, as noted by Kung et al., demonstrates that the extensive, database-driven training of LLMs can provide a significant advantage when addressing questions based on theoretical knowledge [17].
In current study, the least successful application was found to be Copilot. In a study conducted by Nguyen et al., six LLMs were given multiple-choice questions in the field of dentistry, and the most successful were found, respectively, to be Copilot, Claude, and ChatGPT [22]. Copilot’s significantly lower performance in our study compared to the other models may be due to its insufficient amount of sample data or contextual information in a specific field such as pathology. Moreover, the fact that Copilot exhibited varying consistencies between case-based and knowledge-based question types suggests that, since the model was primarily developed as a code-focused assistant tool, it may have only limited capabilities in handling medical-dental terminology.
However, some limitations of the study should also be considered. A significant limitation of our research is that all questions were asked in Turkish. Therefore, inaccuracies arising from language translation may have contributed to misunderstandings or incorrect responses. The limited number of questions may reduce the power of generalization and necessitate additional validation studies on different question types (e.g., image-based, multi-step clinical cases). In this study, only the DUS questions prepared according to the dental curriculum in Türkiye were used, which represents one of the limitations of the study. LLMs might exhibit varying levels of performance when assessed using questions derived from different curricula. Furthermore, differences in the datasets used by the models for generating responses and in the frequency of updates may be regarded as another factor that could affect the results. Future studies should be expanded to include a wider variety of question types, such as image-based questions, multi-step clinical scenarios, case analyses, and open-ended exam questions. Additionally, comprehensive studies involving a larger set of questions will enhance the reliability of the findings.
Conclusion
In conclusion, this study determined that ChatGPT o1 achieved the highest success in the context of DUS oral pathology questions, with Claude ranking among the top performers in case-based questions and Deepseek excelling in knowledge-based questions. The findings indicate that LLMs can serve as an important auxiliary tool in dental education and exam preparation. However, it is recommended that the levels of reliability, currency, and clinical relevance of the models be evaluated more comprehensively in future studies. In this context, the supportive role of AI and LLMs in dental education is expected to gradually increase, and the performance of these tools is anticipated to further improve as technology continues to evolve.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Alowais SA, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689.
Ding H, et al. Artificial intelligence in dentistry—A review. Front Dent Med. 2023;4:1085251.
Bindra S, Jain R. Artificial intelligence in medical science: a review. Ir J Med Sci (1971-). 2024;193(3):1419–29.
Dave M, Patel N. Artificial intelligence in healthcare and education. Br Dent J. 2023;234(10):761–4.
Araújo ALD, et al. Machine learning concepts applied to oral pathology and oral medicine: a convolutional neural networks’ approach. J Oral Pathol Med. 2023;52(2):109–18.
Tassoker M. ChatGPT-4 Omni’s superiority in answering multiple-choice oral radiology questions. BMC Oral Health. 2025;25(1):173.
Taymour N, et al. Performance of the ChatGPT-3.5, ChatGPT-4, and Google gemini large Language models in responding to dental implantology inquiries. The Journal of Prosthetic Dentistry; 2025.
Tosun B, Yilmaz ZS. Comparison of artificial intelligence systems in answering prosthodontics questions from the dental specialty exam in Turkey. Journal of Dental Sciences; 2025.
Aziz AAA, Abdelrahman HH, Hassan MG. The use of ChatGPT and Google gemini in responding to orthognathic surgery-related questions: A comparative study. Journal of the World Federation of Orthodontists; 2024.
Abdul NS, et al. Applications of artificial intelligence in the field of oral and maxillofacial pathology: a systematic review and meta-analysis. BMC Oral Health. 2024;24(1):122.
Harrington C, Robinson F, Mallery SR. Clinical teaching in dentistry: evaluating a clinical oral pathology rotation while looking to the future of dental education. J Dent Educ. 2023;87(7):1016–21.
Mays JW, Sarmadi M, Moutsopoulos NM. Oral manifestations of systemic autoimmune and inflammatory diseases: diagnosis and clinical management. J Evid Based Dent Pract. 2012;12(3):265–82.
Kumar P. Large Language models (LLMs): survey, technical frameworks, and future challenges. Artif Intell Rev. 2024;57(10):260.
Kerimbayev N et al. A Comparative Analysis of Generative AI Models for Improving Learning Process in Higher Education. in 2024 International Conference Automatics and Informatics (ICAI). 2024. IEEE.
Bhayana R. Chatbots and large Language models in radiology: A practical primer for clinical and research applications. Radiology. 2024;310(1):e232756.
Gilson A, et al. How does ChatGPT perform on the united States medical licensing examination (USMLE)? The implications of large Language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.
Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models. PLOS Digit Health. 2023;2(2):e0000198.
Kim W, Kim BC, Yeom H-G. Performance of large Language models on the Korean dental licensing examination: A comparative study. Int Dent J. 2025;75(1):176–84.
Bilgin Avşar D, Ertan AA. A Comparative Study of ChatGPT-3.5 and Gemini’s Performance of Answering the Prosthetic Dentistry Questions in Dentistry Specialty Exam: Cross-Sectional Study. Turkiye Klinikleri J Dent Sci, 2024. 30(4).
Buhr CR, et al. ChatGPT versus consultants: blinded evaluation on answering otorhinolaryngology case–based questions. JMIR Med Educ. 2023;9(1):e49183.
Deveci BA, et al. The performance of AI on the dentistry specialization exam. Int Dent J. 2024;74:S114–5.
Nguyen HC, et al. Accuracy of latest large Language models in answering multiple choice questions in dentistry: A comparative study. PLoS ONE. 2025;20(1):e0317423.
Acknowledgements
None.
Funding
None.
Author information
Authors and Affiliations
Contributions
Study conceptualisation: BEY, BNGY, FO. Study protocol and design: BEY, BNGY, FO. Data collection: BEY, BNGY, FO. Data analysis and interpretation: BEY, BNGY, FO Writing original manuscript: BEY Reviewing and editing of manuscript: BEY, BNGY, FO. Reading and approval of the manuscript: all authors.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Are not applicable. Open-source public data was used in this study.
Consent for publication
Not applicable since there was no direct human contact.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yilmaz, B.E., Gokkurt Yilmaz, B.N. & Ozbey, F. Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis. BMC Oral Health 25, 573 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12903-025-05926-2
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12903-025-05926-2