News

Feb 12, 2026 I was on Chain of Thought to talk about ResearchRubrics, a benchmark to evaluate deep research agents
Feb 06, 2026 Reliable Weak-to-Strong Monitoring of LLM Agents has been accepted to ICLR 2026 as an oral!
Jan 28, 2026 Humanity’s Last Exam (HLE) has been published in Nature!
Jan 26, 2026 Reliable Weak-to-Strong Monitoring of LLM Agents, ResearchRubrics, and MoReBench have been accepted to ICLR 2026
Nov 20, 2025 I recorded a podcast! Check out the latest episode of Chain of Thought on PRBench
Nov 13, 2025 We published PRBench, a new benchmark to measure how well LLMs do on prefessional domains
Nov 12, 2025 Check out our new paper ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Aug 28, 2025 We published Reliable Weak-to-Strong Monitoring of LLM Agents, extending our work on SHADE-Arena
Jun 19, 2025 Excited to share our new paper SHADE-Arena, in collaboration with Anthropic
Jan 27, 2025 I am excited to join Scale AI as a ML Research Ops Lead
Jan 22, 2025 ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization has been accepted at ICLR 2025