GPT-5.2 surpasses expert PhD baseline with 92% science score

0
22

GPT-5.2 surpasses expert PhD baseline with 92% science score

GPT-5.2 scored 92% on a “Google-Proof” science benchmark, significantly surpassing the 70% expert baseline. The advanced model also achieved medal-winning performance in major international competitions, demonstrating its evolving capabilities in scientific reasoning.

Scientists extensively use these systems for tasks like literature searches across various disciplines and languages, as well as navigating complex mathematical proofs. This development often reduces work that typically takes days or weeks to just a few hours. The paper, Early science acceleration experiments with GPT-5, published in November 2025, provides initial evidence that GPT-5 can notably expedite scientific workflows.

To further measure and forecast AI models’ ability to accelerate scientific research, developers introduced FrontierScience, a new benchmark designed to assess expert-level scientific capabilities. The benchmark contains questions written and verified by experts in physics, chemistry, and biology, focusing on originality and difficulty.

FrontierScience features two distinct tracks:

  • Olympiad: Measures scientific reasoning abilities in the style of international Olympiad competitions.
  • Research: Evaluates real-world scientific research capabilities.

In initial evaluations, GPT-5.2 emerged as the top-performing model on both FrontierScience-Olympiad, scoring 77%, and Research, scoring 25%. This performance positions it ahead of other frontier models, including Claude Opus 4.5 and Gemini 3 Pro. The results indicate that current models can support structured reasoning aspects of research, though significant work remains to enhance their open-ended thinking capabilities.

FrontierScience encompasses over 700 textual questions, with 160 in its gold set, spanning subfields in physics, chemistry, and biology. FrontierScience-Olympiad features 100 questions collaboratively designed by 42 international Olympiad medalists and national team coaches. FrontierScience-Research includes 60 original research subtasks developed by 45 PhD scientists, including doctoral candidates, professors, and postdoctoral researchers.

For the Olympiad set, grading occurs through short answer verification. For the Research track, a rubric-based architecture with a 10-point scoring system evaluates open-ended tasks. This rubric assesses both the final answer and intermediate reasoning steps. A model-based grader, GPT-5, evaluates responses against these criteria. Each task’s creation involved selecting against internal models, which may bias evaluations against specific models.

Key performance results include:

  • FrontierScience-Olympiad Accuracy:
    • GPT-5.2: 77.1%
    • Gemini 3 Pro: 76.1%
    • Claude Opus 4.5: 71.4%
  • FrontierScience-Research Accuracy:
    • GPT-5.2: 25.2%
    • Claude Opus 4.5: 17.5%
    • Grok 4: 15.9%

Longer processing times, or higher reasoning efforts, correlated with improved accuracy for both GPT-5.2 and OpenAI o3. For instance, GPT-5.2’s accuracy on FrontierScience-Olympiad increased from 67.5% at “Low” reasoning effort to 77.1% at “XHigh” effort. Similarly, on FrontierScience-Research, GPT-5.2’s accuracy rose from 18.2% at “Low” to 25.2% at “XHigh.”

FrontierScience currently focuses on constrained problem statements and does not assess the generation of novel hypotheses or interactions with multimodal data. Developers plan to iterate on the benchmark, expanding it to new domains and integrating more real-world evaluations as models improve.

Featured image credit