
A group of researchers used NPR host Will Shortz’s weekly puzzle segment to test AI models’ reasoning abilities.
Experts from several U.S. colleges and universities, supported by the startup Cursor, developed a universal AI benchmark using puzzles from Sunday Puzzle episodes. The study revealed intriguing insights, including the fact that chatbots sometimes “give up” and consciously provide incorrect answers.
Sunday Puzzle is a weekly radio quiz featuring logic and syntax-based riddles. No specialized knowledge is required to solve them—only critical thinking and reasoning skills.
Co-author Arjun Guha explained to TechCrunch that puzzles are effective tests because they don’t rely on obscure knowledge, and their structure makes it difficult for AI models to rely on “mechanical memory.”
“These puzzles are hard because it’s tough to make meaningful progress until you actually solve them—only then does everything click. They require a mix of intuition and elimination,” he said.
However, Guha noted the method’s limitations—Sunday Puzzle is designed for English speakers, and its questions are publicly available, meaning AI can “cheat.” Researchers plan to expand the benchmark, which currently consists of about 600 puzzles.
In testing, o1 and DeepSeek R1 significantly outperformed other models in reasoning. Leading AI models carefully verified their answers before responding, but the process took much longer than usual.

However, overall accuracy remained below 60%. Some models refused to solve the puzzles entirely. When DeepSeek’s AI couldn’t find the right answer, it wrote, “I give up,” before generating an incorrect response, seemingly chosen at random.
Other models repeatedly attempted to correct previous mistakes but still failed. Some got “stuck in thought loops,” generating gibberish, while others arrived at correct answers but later rejected them.

“In difficult cases, DeepSeek’s R1 literally says it’s ‘frustrated.’ It’s amusing to see a model mimic what a human might say. We still need to explore how ‘frustration’ in reasoning affects AI output quality,” Guha noted.
Previously, the researcher tested seven popular chatbots in a chess tournament—none managed to fully handle the game.