BMC Oral Health

Table 3 Evaluation of large language models’ performance by question type

From: Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

LLM	Case-Based Questions			Knowledge-Based Questions
LLM	Correct	Incorrect	P*	Correct	Incorrect	P*	P**
Gemini 1.5	25	4	0.034	56	15	0.000	0,292
Gemini 2	24	5		58	13		0,574
Copilot	22	7		39	32		0,041
Deepseek	21	8		61	10		0,098
Claude	26	3		58	13		0,252
ChatGPT 4o	22	7		57	14		0,404
ChatGPT 4	17	12		52	19		0,117
ChatGPT o1	27	2		69	2		0,330

*,** Pearson Chi Square. The statistical significance level was set at P ≤ 0.05

Back to article page

ISSN: 1472-6831

Contact us

General enquiries: journalsubmissions@springernature.com