Researchers Used NPR Sunday Puzzle Questions To Benchmark AI ‘Reasoning’ Models

Read Time: 2 minutes

The Sunday Puzzle is a long-running series on National Public Radio that enables Will Shortz, the crossword puzzle specialist for The New York Times, to test thousands of listeners every Sunday. Even expert participants typically find it challenging to discover the brainteasers.

 After the fact, they are created to be solved with no previous details. Because of this, some investigators argue they are an encouraging method to test the limitations of AI's capacity for problem-solving.

AI Testing with Sunday Puzzles

Researchers from Oberlin College, Wellesley College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor develop an artificial intelligence benchmark employing Sunday Puzzle riddles in a current study. According to the team, their test highlighted unexpected findings, including the reality that reasoning models, such as OpenAI's o1, can "give up" and provide answers they know are incorrect.

Arjun Guha, a Northeastern computer science professor and one of the study's co-authors, claimed, "We wanted to develop a benchmark with problems that humans can comprehend with only general information."

The AI sector has recently experienced a benchmarking conundrum. The majority of tests generally used to assess AI models look for qualities that are not relevant to the typical user, including expertise in PhD-level math and science problems. In the meantime, a number of benchmarks are quickly reaching the saturation point, even those that were only recently released.

AI Challenges in Logical Thinking

According to Guha, the Sunday Puzzle and other public radio quiz games have the benefits of not testing for esoteric details and having written tasks so that models cannot utilize "rote memory" to answer them.

"I think what makes these problems difficult is that until you overcome a problem, which is when everything clicks together all at once, it's tough to make meaningful progress on a problem," Guha claimed. "That calls for a process of removal as well as insight."

Of course, no benchmark is outstanding. The Sunday Puzzle is restricted to English and concentrated in the United States. Additionally, since the tests are freely available, models built on them might be permitted to "cheat" somehow, but Guha states he hasn't seen any proof of this.

He continued, "We can expect the most recent questions to be unseen, and new ones are published weekly." "We plan to examine changes in model performance over time and manage a recent benchmark."

Reasoning models like o1 and DeepSeek's R1 score significantly higher than the others on the researchers' benchmark, which contains about 600 Sunday Puzzle puzzles. Reasoning models prevent some of the common errors that AI models make by extensively fact-checking themselves before delivering findings. As an alternative, reasoning models often require a few more seconds to minutes to arrive at conclusions.

The models also make other strange decisions, such as delivering an incorrect response only to rapidly reverse it, trying to obtain a better response, and failing again. Additionally, they become stuck "thinking" constantly and offer illogical justifications for their responses, or they quickly arrive at the proper solution but then examine other options for no evident reason.

Guha remarked, "R1 literally claims that it's getting 'frustrated' on challenging problems." It was exciting to examine how a model mimics human speech. However, the impact of "frustration" in reasoning on the caliber of model outcomes is still not recognized.