Be careful when AI passes this exam

If you are looking for a new reason to panic with artificial intelligence, try it: Some of the most clever humans in the world are struggling to make tests that AI cannot pass.
By years, the AI system was measured by giving new models a variety of standardized benchmark tests. Many of these tests included problems with challenging, SAT-ability in areas such as mathematics, science and logic. Comparing the marks of models over time acts as a thick measurement of AI progression.
But the AI system eventually became very good in the tests, so new, difficult tests were created – often with the type of questions that graduate students may face their exams.
Those tests are also not in good condition. New models of companies like OpenAI, Google and Anthropic are scoring high marks on many PhD level challenges, which has limited the utility of those tests and has created a terrible question: Is the AI system very smart to measure our measuring Are happening?
This week, researchers from Center for AI Safety and Scale AI are releasing a possible answer to that question: a new evaluation, which “Last exam of humanity“He claims that this is the most difficult test given on the AI system so far.”
Humanities Last Exam is the brainchild of Dan Hendrix, a famous AI security researcher and director of Center for AI Safety. (The original name of the test, “the last stand of humanity,” was removed due to excessive dramatic.)
Mr. Hendrix worked with Scale AI, an AI company where he is an advisor to compile the test, in which almost designed to test the abilities of AI systems in areas ranging from analytical philosophy to rocket engineering. 3,000 multiple choice and miniature northern questions include. ,
Questions were presented by experts in these areas, including the professor and prize winner mathematicians of the college, who were asked to ask very difficult questions, whose answers they knew.
Here, try your hand on a question about Humingbird anatomy from the test:
Within epodiformis, the hummingbird specificly consists of a bilaterally coupled oval bone, which is a semoid embedded in the turbulent part of cruciate aponurosis of the insertion of M. Disabled tail. How many coupled tendons are supported by this seasonal bone? Answer with a number.
Or, if physics is more than your speed, try it:
A block is placed on a horizontal rail, with which it can slip freely. It is associated with a rigid, massless rod end of R length. A mass is attached to the other end. The weight of both items is W. The system is initially stable, whose mass is directly above the block. The mass is given an extremely subtle push parallel to the rail. Suppose the system is designed in such a way that the rod can rotate up to 360 degrees without any hindrance. When the rod is horizontal, it contains tension T1. When the rod is again vertical, the mass is directly under the block, it contains tension t2. (Both these quantities can be negative, which will indicate that the rod is in compression.) (T1 -T2)/W value?
(I will print the answer here, but this will spoil the test for any AI system being trained on this column. In addition, I myself am very foolish to verify the answers.)
The questions of the final examination of Humanity passed through a two-step filtering process. First, the AI model was given to the leading AI model to resolve the submitted questions.
If the model could not answer them (or if, in terms of multiple choice questions, the model performed worse than random estimates), then the questions were given to a group of human reviewers who refined them and verified the correct answers. Experts writing top-respected questions were paid between $ 500 and $ 5,000 per question, as well as credit for contribution to the examination.
Postdorator researcher Kevin Zhou presented a handful of questions for the test at the theoretical particle physics at the University of California, Berkeley. Three of his questions were selected, all of which he told me that he was “suit what he could see in the graduation exam.”
Mr. Hendrix, who helped create a widely used AI test known as Massive Multitasks Language Understanding or MMLU, said he was inspired to make a hard AI test from interaction with Elon Musk Was. (Mr. Hendrix is also the security advisor of Mr. Musk’s AI Company, XAI.) He said, Mr. Musk expressed concern about the existing tests given to the AI model, which he found very easy.
“Elon saw MMLU questions and said,” These are of graduation. I want things that a world class expert can do, “said Mr. Hendrix.
There are some other tests that are trying to measure advanced AI abilities in some domains, such as a test developed by Frontiermath, Apok AI, and ARC-AGIOne test AI researcher developed by Francois Chollet.
But the purpose of Humanities Last Exam is to determine how good the AI systems are in answering complex questions in different types of educational subjects, which we can be considered as a general intelligence score.
Mr. Hendrix said, “We are trying to guess to what extent AI can really automatic intellectual labor.”
Once the list of questions was compiled, the researchers gave six major AI models Humanities Last Exam, including Google’s Gemini 1.5 Pro and Anthropic Cloud 3.5 sonnet. They all failed badly. The O1 system of OpenAI scored the most in the group with 8.3 percent score.
(The New York Times has filed a case accusing OpenAI and its partner Microsoft accusing the AI system of copyright violation of news material related to the AI system. OpenAI and Microsoft have denied those claims.)
Mr. Hendrix said that he hoped the score will grow rapidly, and potentially exceeding 50 percent by the end of the year. At that time, he said, the AI system can be considered a “world class divine”, which is capable of giving more accurate answers to questions on any subject than human experts. And we may have to look for other methods to measure the effects of AI, such as looking at economic data or assessing whether it can make a new discovery in areas such as mathematics and science.
Scale’s Summer U said, “You can imagine a better version of it where we can give questions that we do not know yet, and we are able to verify that models are able to solve it for us to solve it Whether able to help or not. ” AI Research Director and Organizer of Examination.
These days, a part of what is going to confuse AI’s progress is how irregular it is. We have AI models that are capable of diagnosing diseases more effectively than human doctors. Winning silver medal in International Mathematics Olympiad And Top human programmer beaten up Competitive coding on challenges.
But these models sometimes struggle in basic functions such as arithmetic or writing verses. It has given them a very different perceptions about how fast AI is improving in some things, and how fast it is improving in AI, and it depends on it, it depends on Whether you are looking at the best or worst output.
That disparity has also made it difficult to measure these models. I wrote last year that we need better evaluation for the AI system. I still believe in him. But I also believe that we need more creative methods to monitor the progress of AI that do not depend on standardized tests, because whatever humans do – and we are afraid that AI will improve us. – He cannot be included in the written examination. ,
Mr. Jhou, theoretical particle physics researcher, who presented questions in the Humanities Last Exam, told me that while the AI models were often influential in answering complex questions, they did not consider them a threat to themselves and their colleagues, because their Jobs include a lot more than spewing correct answers.
He said, “What is the meaning of taking the exam and what is the meaning of being a physicist and researcher, there is a big difference between it.” “Even an AI that can answer these questions may not be ready to help in research, which is naturally less structured.”
(Tagster) Artificial Intelligence (T) Tests and Exams (T) Innovation (T) Research (T) AI Safety Center (CAIS) (T) Computer and Internet
#careful #passes #exam