- The precision obtained by the AI of the best note in the hardest reference in the world is improved 183% in just two weeks
- Chatgpt O3-Mini now scores precision up to 13% depending on the capacity
- OPENAI Deep Research erases competition with a precision result of 26.6%
The most difficult AI examination in the world, the last examination of humanity, was launched less than two weeks ago, and we have already seen a huge precision jump, with Chatgpt O3-Mini and now the Openai deep reasoning at the top of the ranking.
The benchmark for AI created by experts from around the world contains some of the most difficult problems and questions of human -known reasoning – it is so difficult that when I have already written on the last examination of humanity In the article linked above, I could not even understand one of the questions, and even less, answer it.
At the time of writing this last article, the global Deepseek R1 phenomenon was seated at the top of the classification with a 9.4% precision score when evaluated only on the text (not multimodal). Now, O3-Mini d’Openai, which was launched earlier this week, has obtained a precision of 10.5% at the O3-Mini, and a precision of 13% at the O3-Mini-High, this Who is smarter but takes more time to generate answers.
More impressive, however, is the new score of the OPENAI Deep Research IA agent on the benchmark, the new tool marking 26.6%, a huge increase of 183% of the accuracy of the results in less than 10 days . Now, it should be noted that deep research has research capacities that make comparisons slightly unfair, like other AI models do not. The possibility of looking for the web is useful for a test like the last examination of humanity, because it includes certain general questions based on knowledge.
That said, the accuracy of the results of the models taking the latest results of the humanity examination improves regularly, and that makes you ask you how long we will have to wait to see a model of AI get closer to the reference . Realistically, AI should not be able to get closer to it soon, but I would not bet against it.
It seems that the last OpenAi model is going very well on many subjects. I assume that in -depth research is particularly helpful with subjects such as medicine, classics and law. pic.twitter.com/x8ilmq1aqsFebruary 3, 2025
Better, but 26.6% never had wanders
OPENAI Deep Research is an incredibly impressive tool, and I was blown away by the examples that Optai showed when he announced the agent of IA. In -depth research is able to work as a personal analyst, taking the time to conduct intense research and to offer relationships and responses which, otherwise, would take hours and hours of humans.
While a score of 26.6% on the last examination of humanity is really impressive, especially given the measure in which the classification of the reference arrived in a few weeks, it is always a low score in terms Absolus – No one would have claimed to pass a test with anything less than 50% in the real world.
The latest examination of humanity is an excellent reference, and which will prove to be invaluable as the models of AI develop, allowing us to assess how far they have arrived. How long will we have to wait to see an AI get around the 50%mark? And what model will be the first to do it?