In a significant step toward integrating artificial intelligence into scientific inquiry, OpenAI has released FrontierScience, a benchmark designed to rigorously evaluate AI's ability to engage in scientific problem-solving. Reflecting a shared ambition among industry leaders, this new evaluation framework measures how AI models perform on high-level scientific questions, ranging from Olympiad-level problems to complex research inquiries curated by Ph.D.-credentialed scientists.
AI pioneers have long envisioned that advancements in machine intelligence could revolutionize scientific understanding. Demis Hassabis of DeepMind established his organization with the mission to "solve intelligence" and subsequently leverage that breakthrough to "solve everything else." Similarly, Sam Altman emphasized the transformative potential of AI accelerating scientific progress, promising substantial improvements to quality of life. Dario Amodei of Anthropic projected that by 2026 AI might generate outputs equating to "a country of geniuses in a data center." Among the various motivators propelling the AI surge, the conviction that AI could unlock greater comprehension of the universe remains one of the most longstanding and compelling.
The FrontierScience benchmark, unveiled recently by OpenAI, addresses this challenge by incorporating two tiers of scientific questions that test models in physics, chemistry, and biology. The Olympiad tier includes problems designed to be solvable by the brightest young minds, while the Research tier escalates in complexity, featuring open-ended questions that require reasoning, judgment, and engagement akin to real-world research situations.
For example, one research-level question involved analyzing "meso-nitrogen atoms in nickel(II) phthalocyanine," a problem requiring extensive computational modeling that could extend over multiple days. Francisco Martin-Martinez, a senior chemistry lecturer at King’s College London, notes the demanding nature of such simulations. Another comparable research inquiry requested the derivation of "electrostatic wave modes" in plasma. Tom Ashton-Key, a plasma physics doctoral candidate at Imperial College London, explained that similar mathematical derivations previously required weeks to complete, with a segment of his research time routinely dedicated to such complex analyses.
Results from FrontierScience affirm a consistent trajectory of rapid advancements in AI's scientific problem-solving abilities. Initially, progress was modest, but recent developments—particularly involving reinforcement learning and reasoning-optimized models—have propelled performance upward significantly. OpenAI's GPT-5.2 currently leads the benchmark, attaining a 77.1% accuracy rate on Olympiad-level challenges and 25.3% on the more intricate Research set. Although the incremental improvement over its predecessor GPT-5 in the Research category is minimal, this level marks notable progress toward an AI system that can effectively augment scientific investigation.
Miles Wang, a member of OpenAI's evaluation team, stresses the potential impact of AI achieving near-perfect performance on research-level problems, suggesting it could become a highly competent collaborator that amplifies the productivity of scientists and doctoral students. Nonetheless, Wang acknowledges that FrontierScience does not encompass all critical scientific capabilities, particularly those involving experimental execution or the interpretation of non-textual data such as images and videos.
Furthermore, the comparatively small question samples—100 in the Olympiad category and 60 in Research—limit the reliability of detailed comparisons among closely matched models. The absence of a human performance baseline for these questions complicates efforts to contextualize AI performance against typical expert proficiencies. Jaime Sevilla, director of Epoch AI research institute, characterizes the benchmark as a valuable addition to the ecosystem but underscores the inherent difficulty in constructing meaningful AI evaluation metrics at this frontier.
This challenge is magnified by the scarcity of domain specialists capable of authoring highly technical questions and appraising responses, a hurdle that demands considerable investment of time and financial resources. When the question creators themselves are among the world’s foremost experts on a topic, determining the difficulty and validity of problems becomes even more complex. OpenAI partners with expert annotation firms, like Mercor and Surge AI, which engage academic experts to design questions and evaluate AI answers, highlighting the specialized labor demanded by such assessments.
AI's influence on scientific disciplines already manifests in select narrow applications. For instance, DeepMind’s AlphaFold has predicted more than 200 million protein structures, accomplishing in silico what would take centuries to verify experimentally. Separate initiatives seek to simulate plasma behavior within fusion devices and to enhance the resolution of weather forecasting. However, these represent specialized, domain-specific deployments rather than broad-spectrum scientific intellect.
OpenAI and other organizations aspire to develop AI systems versatile enough to assist throughout the scientific workflow, from experiment design to data analysis across diverse fields. Evidence of AI’s growing prowess in mathematical and computational tasks offers a glimpse of this future. OpenAI mathematician Sebastien Bubeck recounts how GPT-5 solved longstanding mathematical problems that eluded his graduate students for years, identifying a unique mathematical identity after two days of continuous processing. Research efforts have similarly accelerated coding tasks; Keith Butler, an associate professor of chemistry at University College London, reports reducing coding assignments from hours to mere minutes, which revitalized his capacity to engage with computational work. However, Butler remains cautiously skeptical regarding AI's present ability to independently propose novel scientific hypotheses.
Others adopt a more critical perspective. Carlo Rovelli, theoretical physicist and chief editor of Foundations of Physics, voices concerns over the substantial volume of low-quality scientific papers emerging from AI interactions, which has inundated academic journals and hindered human editors. He describes the overwhelming burden incurred by submissions generated through superficial conversations with language models, many of which lack genuine scientific merit.
Despite these challenges, if AI models continue to improve in line with trends documented by FrontierScience, their role as reliable research assistants could soon become a reality. This rapid pace inspires both excitement and a sense of bewilderment among scientists like Francisco Martin-Martinez, who humorously admits to needing AI to summarize his own complex emotions about the unfolding rate of AI developments.