| Feb 12, 2026 | I was on Chain of Thought to talk about ResearchRubrics, a benchmark to evaluate deep research agents |
| Feb 06, 2026 | Reliable Weak-to-Strong Monitoring of LLM Agents has been accepted to ICLR 2026 as an oral! |
| Jan 28, 2026 | Humanity’s Last Exam (HLE) has been published in Nature! |
| Jan 26, 2026 | Reliable Weak-to-Strong Monitoring of LLM Agents, ResearchRubrics, and MoReBench have been accepted to ICLR 2026 |
| Nov 20, 2025 | I recorded a podcast! Check out the latest episode of Chain of Thought on PRBench |
| Nov 13, 2025 | We published PRBench, a new benchmark to measure how well LLMs do on prefessional domains |
| Nov 12, 2025 | Check out our new paper ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents |
| Aug 28, 2025 | We published Reliable Weak-to-Strong Monitoring of LLM Agents, extending our work on SHADE-Arena |
| Jun 19, 2025 | Excited to share our new paper SHADE-Arena, in collaboration with Anthropic |
| Jan 27, 2025 | I am excited to join Scale AI as a ML Research Ops Lead |
| Jan 22, 2025 | ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization has been accepted at ICLR 2025 |