News

Nov 20, 2025 I recorded a podcast! Check out the latest episode of Chain of Thought on PRBench
Nov 13, 2025 We published PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning, a new benchmark to measure how well LLMs do on prefessional domains
Nov 12, 2025 Check out our new paper ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Aug 28, 2025 We published Reliable Weak-to-Strong Monitoring of LLM Agents, extending our work on SHADE-Arena
Jun 19, 2025 Excited to share our new paper SHADE-Arena, in collaboration with Anthropic
Jan 27, 2025 I am excited to join Scale AI as a ML Research Ops Lead
Jan 22, 2025 ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization has been accepted at ICLR 2025