Skip to main content

Table 2 Pairwise comparisons of large language models’ for all questions

From: Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis

 

Gemini 1.5

Gemini 2

Copilot

Deepseek

Claude

ChatGPT 4o

ChatGPT 4

ChatGPT o1

Gemini 1.5

-

0.856

0.002

0.856

0.577

0.724

0.050

0.001

Gemini2

0.856

-

0.000

0.573

0.425

0.592

0.033

0.002

Copilot

0.002

0.000

-

0.001

0.000

0.004

0.150

0.000

Deepseek

0.856

0.573

0.001

-

0.707

0.592

0.024

0.001

Claude

0.577

0.425

0.000

0.707

-

0.363

0.012

0.005

ChatGPT 4o

0.724

0.592

0.004

0.592

0.363

-

0.107

0.000

ChatGPT 4

0.050

0.033

0.150

0.024

0.012

0.107

-

0.000

ChatGPT o1

0.001

0.002

0.000

0.001

0.005

0.000

0.000

-

  1. *(p < 0.0031, bonferroni correction)