Researchers conducted a three-way Turing test for four AI systems — ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5. The latter scored the highest.

In a paper published on March 31, Cameron Jones and Benjamin Bergen from the Department of Cognitive Science at the University of California, San Diego, shared the results of the experiment.

They used the original three-way version of the test — participants had five-minute conversations simultaneously with another person and one of the AI systems, after which they determined which of the interlocutors they thought was human. This version is more complex compared to the test where people only communicate with the machine.

In 73% of cases, participants thought GPT-4.5 was human. Other AIs scored lower:

LLaMa-3.1 — 56%;
ELIZA — 23%;
GPT-4o — 21%.

“These findings represent the first empirical evidence that an artificial system can pass the standard three-way Turing test,” the researchers noted.

The Turing test is a conceptual test proposed by British mathematician Alan Turing in 1950 to determine a computer’s ability to exhibit intelligent behavior indistinguishable from that of a human.

Test details:

A person communicates in writing with two interlocutors: another human and an AI system.
If the participant cannot confidently determine which of them is the machine, the computer is considered to have passed the test.

The Turing test has been repeatedly conducted with popular AI models. In June 2024, people failed to distinguish ChatGPT from a human interlocutor 54% of the time. ELIZA scored 22%, GPT-3.5 — 50%, and humans — 67%.

In a similar 2023 study by Jones, GPT-4 scored 41%, GPT-3.5 — 14%, ELIZA — 27%, and humans scored 63%.

It is worth noting that in February 2025, OpenAI released a new version of the chatbot, GPT-4.5, with advanced “emotional intelligence.”