Skip to main content

Table 3 Evaluation of large language models’ performance by question type

From: Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

LLM

Case-Based Questions

Knowledge-Based Questions

 

Correct

Incorrect

P*

Correct

Incorrect

P*

P**

Gemini 1.5

25

4

0.034

56

15

0.000

0,292

Gemini 2

24

5

58

13

0,574

Copilot

22

7

39

32

0,041

Deepseek

21

8

61

10

0,098

Claude

26

3

58

13

0,252

ChatGPT 4o

22

7

57

14

0,404

ChatGPT 4

17

12

52

19

0,117

ChatGPT o1

27

2

69

2

0,330

  1. *,** Pearson Chi Square. The statistical significance level was set at P ≤ 0.05