| Nov 20, 2025 | I recorded a podcast! Check out the latest episode of Chain of Thought on PRBench |
| Nov 13, 2025 | We published PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning, a new benchmark to measure how well LLMs do on prefessional domains |
| Nov 12, 2025 | Check out our new paper ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents |
| Aug 28, 2025 | We published Reliable Weak-to-Strong Monitoring of LLM Agents, extending our work on SHADE-Arena |
| Jun 19, 2025 | Excited to share our new paper SHADE-Arena, in collaboration with Anthropic |
| Jan 27, 2025 | I am excited to join Scale AI as a ML Research Ops Lead |
| Jan 22, 2025 | ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization has been accepted at ICLR 2025 |