Skip to main content

Table 5 Pairwise comparisons of large language models’ for knowledge-based question

From: Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

 

Gemini 1.5

Gemini 2

Copilot

Deepseek

Claude

ChatGPT 4o

ChatGPT 4

ChatGPT o1

Gemini 1.5

-

0.673

0.002

0.271

0.673

0.835

0.432

0.001

Gemini2

0.673

-

0.001

0.494

1.000

0.831

0.228

0.002

Copilot

0.002

0.001

-

0.000

0.000

0.001

0.023

0.000

Deepseek

0.271

0.494

0.000

-

0.494

0.370

0.061

0.016

Claude

0.673

1.000

0.000

0.494

-

0.831

0.228

0.002

ChatGPT 4o

0.835

0.831

0.001

0.370

0.831

-

0.320

0.001

ChatGPT 4

0.432

0.228

0.023

0.061

0.228

0.320

-

0.000

ChatGPT o1

0.001

0.002

0.000

0.016

0.002

0.001

0.000

-

  1. *(p < 0.0031, bonferroni correction)