Chen Bo Calvin Zhang

ML Research Ops Lead @ Scale AI | Previously @ CHAI, MIT, ETH Zurich, and University of Manchester

profile_photo.jpg

I am the ML Research Ops Lead at Scale AI, where I focus on designing and running evaluations, developing benchmarks, and managing leaderboards.

Previously, I was a research intern at CHAI, where I worked with Micah Carroll on multi-turn red teaming strategies for large language models. I also spent time as a visiting scholar at MIT, where I focused on online learning and reward design for reinforcement learning, working with Zhang-Wei Hong, Aldo Pacchiano, and Pulkit Agrawal.

I received my MSc in Data Science from ETH Zurich, where I conducted reserach in preference-based reinforcement learning with Giorgia Ramponi.

Earlier, I completed my BSc (Hons) in Computer Science and Mathematics from the University of Manchester, with a focus on adversarial attacks in deep reinforcement learning under the supervision of Tingting Mu.

My research interests include evaluation methodologies, agents, sequential decision making, and AI safety and alignment.

Google Scholar  /  Twitter  /  GitHub  /  LinkedIn

News

Feb 12, 2026 I was on Chain of Thought to talk about ResearchRubrics, a benchmark to evaluate deep research agents
Feb 06, 2026 Reliable Weak-to-Strong Monitoring of LLM Agents has been accepted to ICLR 2026 as an oral!
Jan 28, 2026 Humanity’s Last Exam (HLE) has been published in Nature!
Jan 26, 2026 Reliable Weak-to-Strong Monitoring of LLM Agents, ResearchRubrics, and MoReBench have been accepted to ICLR 2026
Nov 20, 2025 I recorded a podcast! Check out the latest episode of Chain of Thought on PRBench
Nov 13, 2025 We published PRBench, a new benchmark to measure how well LLMs do on prefessional domains
Nov 12, 2025 Check out our new paper ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Aug 28, 2025 We published Reliable Weak-to-Strong Monitoring of LLM Agents, extending our work on SHADE-Arena
Jun 19, 2025 Excited to share our new paper SHADE-Arena, in collaboration with Anthropic
Jan 27, 2025 I am excited to join Scale AI as a ML Research Ops Lead
Jan 22, 2025 ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization has been accepted at ICLR 2025

Selected Publications

2026

  1. Nature
    Humanity’s Last Exam
    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, and 2 more authors
    Nature, 2026

2025

  1. arXiv
    PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
    Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, and 2 more authors
    arXiv preprint arXiv:2511.11562, 2025
  2. ICLR
    ResearchRubrics: A Benchmark of Prompts and Rubrics for Evaluating Deep Research Agents
    Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, and 2 more authors
    The Fourteenth International Conference on Learning Representations (ICLR), 2025
  3. ICLR
    MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More Than Outcomes
    Yu Ying Chiu, Michael S Lee, Rachel Calcott, Brandon Handoko, Paul Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, and 2 more authors
    The Fourteenth International Conference on Learning Representations (ICLR), 2025
  4. arXiv
    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, and 13 more authors
    arXiv preprint arXiv:2509.16941, 2025
  5. ICLR
    Reliable Weak-to-Strong Monitoring of LLM Agents (Oral)
    Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q Knight, and Zifan Wang
    The Fourteenth International Conference on Learning Representations (ICLR), 2025
  6. arXiv
    SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
    Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, and 3 more authors
    arXiv preprint arXiv:2506.15740, 2025
  7. ICLR
    ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
    Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, and Pulkit Agrawal
    In The Thirteenth International Conference on Learning Representations (ICLR), 2025

2023

  1. ICML
    HIP-RL: Hallucinated Inputs for Preference-based Reinforcement Learning in Continuous Domains
    Chen Bo Calvin Zhang, and Giorgia Ramponi
    In ICML 2023 Workshop: The Many Facets of Preference-Based Learning, 2023