
Openai launched the O3 series of artificial intelligence (AI) models focused on reasoning last month. During the live broadcast, the company shared the benchmark scores of the model based on internal tests. While all shared scores are impressive and highlights the successor’s ability to improve O1, a benchmark score stands out. In the Arc-Agi benchmark, the Big Language Model (LLM) scored 85%, down 30% with previous best results. Interestingly, this score is also comparable to the average score of humans on the test.
Openai scores 85% on Arc-Agi benchmark
But just because O3 scores so high in the test, does that mean its intelligence equal to the intelligence of an average person? If the AI model is released in the public domain, we can test it, which will be easier to answer. Since Openai has not disclosed anything about the architecture, training techniques or dataset of the model, it is difficult to end up claiming anything.
We have some knowledge of AI companies’ reasoning-focused models, which can help us understand what OpenAI’s upcoming LLM expectations are. First, so far, the O-Series models have not undergone major overhauls in their architecture or framework, but have been fine-tuned to demonstrate enhanced functionality.
For example, developers use a technique that uses the O1 series of AI models, called test time calculation. This way, the AI model will get additional processing time to spend on the problem and workspace to test the theory and correct any errors. Similarly, the GPT-4O model is just a fine-tuned version of GPT-4.
Given the rumors, it is unlikely that the company will make major architecture changes with the O3 model, as it can also launch the GPT-5 AI model later this year.
Entering the benchmark for Arc-Agi (Abstract Inference Corpus – Artificial General Intelligence), it has a series of grid-based pattern recognition problems that require inference and spatial understanding to solve. This can be done with a large number of high-quality data sets that focus on inference and capability-based logic.
However, if this is an older AI model that is as simple as that, it will also score high in the test. It is worth noting that the previous highest score was 55%, while the O3 scored 85% was 55%. This highlights the addition of new improved technologies and algorithms by developers to enhance the model’s inference capabilities. Unless Openai officially reveals technical details, its full scope cannot be explained.
That being said, it is unlikely that the O3 AI model will reach AGI or human-level intelligence. First, if that is the case, it will mark the end of the company’s partnership with Microsoft, which will end after the Openai model reaches its AGI identity. Second, many AI experts, including Geoffrey Hinton, the godfather of AI, have repeatedly emphasized that we have many years of history since AGI.
Finally, Agi is such a huge achievement that if Openai does reach this milestone, it will make it clear that instead of sharing subtle hints about it. The possibility here is most likely that the O3 AI model has found a way to improve the model’s pattern-based inference functionality (by adding enough sampled data or by adjusting the training method), which is also highlighted in the PTI report.
However, this improvement may be very isolated and does not mean an improvement in the overall intelligence level of the model.