Skip to main content

Table 4 Pairwise comparisons of large language models’ for case-based question

From: Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

 

Gemini 1.5

Gemini 2

Copilot

Deepseek

Claude

ChatGPT 4o

ChatGPT 4

ChatGPT o1

Gemini 1.5

-

0.525

0.315

0.195

0.687

0.315

0.019

0.389

Gemini2

0.525

-

0.517

0.345

0.446

0.517

0.043

0.227

Copilot

0.315

0.517

-

0.765

0.164

1.000

0.162

0.070

Deepseek

0.195

0.345

0.765

-

0.094

0.764

0.269

0.037

Claude

0.687

0.446

0.164

0.094

-

0.164

0.002

0.640

ChatGPT 4o

0.315

0.517

1.000

0.764

0.164

-

0.162

0.070

ChatGPT 4

0.019

0.043

0.162

0.269

0.002

0.162

-

0.002

ChatGPT o1

0.389

0.227

0.070

0.037

0.640

0.070

0.002

-

  1. *(p < 0.0031, bonferroni correction)