AI developers are using increasingly inventive methods to measure the possible uses of generative AI models as traditional AI benchmarking approaches prove insufficient. That's the case for one set of developers: the Microsoft-owned sandbox-building game Minecraft.
Minecraft Benchmark was created in collaboration to compete with AI models in head-to-head tasks where they must create Minecraft creations in response to suggestions. Users can vote for the model that performed the best, and only then can they discover which AI created each Minecraft build.
The Foundation of MC-Bench
Adi Singh, the 12th grader who founded MC-Bench, places more weight on people's knowledge of the game than the game itself. It is still possible to determine whether the blocky image of a pineapple is more realistic, even for those who have not played the game. He said:
“Playing Minecraft makes it much easier for people to observe the progress in AI research. People are used to Minecraft's appearance and atmosphere.”
There are now eight volunteer contributors listed on MC-Bench. According to MC-Bench's website, Anthropic, Google, OpenAI, and Alibaba have provided financial support for the project's usage of their tools to conduct comparison prompts; however, the firms are not otherwise connected.
Can Games Reveal AI’s Limits?
Adi Singh stated:
“At the moment, we are only performing basic projects to show how far we have come since the GPT-3 era, but [we] may eventually scale to these more detailed and goal-oriented jobs. I think games are a better way to test agent-based reasoning because they are safer than real life and easier to control for testing."
Other games, such as Street Fighter, Pictionary, and Pokémon Red, have been used to test AI metrics, partly because benchmarking AI is widely challenging.
AI's Strengths and Weaknesses in Testing
Researchers frequently test AI models using structured evaluations, although many of these tests offer AI an advantage. The way models are taught makes them naturally skilled at specific, limited types of problem-solving, especially those that call for simple extension or rote learning.
In other words, it's difficult to understand why OpenAI's GPT-4 can achieve an LSAT score in the 88th percentile while failing to identify the number of Rs in the word "strawberry." Anthropic's Claude 3.7 Sonnet scored 62.3% on a common software engineering benchmark, but it performs lower at Pokémon than most five-year-olds.
Testing AI with Visual Builds
Technically speaking, MC-Bench is a programming benchmark because it requires the models to write code to generate the requested build. But the majority of MC-Bench users find it simpler to determine whether the model looks better than to go into coding, which broadens the project's popularity and may lead to the collection of more data regarding which models routinely perform better.
Naturally, there is disagreement over whether such scores represent significant AI utility. But according to Singh, they are a powerful signal. Singh stated:
"Unlike many pure text benchmarks, the current scoreboard reflects quite closely to my own experience of using these models. Companies may find [MC-Bench] helpful in determining whether they are on the correct track.”