Chen Bo Calvin Zhang

I am a Research Program Manager at OpenAI, where I work on evals.

Previously, I was the ML Research Ops Lead at Scale AI, where I focused on designing and running evaluations, developing benchmarks, and managing leaderboards.

I also worked as a research intern at CHAI, where I collaborated with Micah Carroll on multi-turn red teaming strategies for large language models. Before that, I was a visiting scholar at MIT, where I focused on online learning and reward design for reinforcement learning with Zhang-Wei Hong, Aldo Pacchiano, and Pulkit Agrawal.

I received my MSc in Data Science from ETH Zurich, where I conducted research in preference-based reinforcement learning with Giorgia Ramponi. Earlier, I completed my BSc (Hons) in Computer Science and Mathematics from the University of Manchester, with a focus on adversarial attacks in deep reinforcement learning under the supervision of Tingting Mu.

My research interests include evaluation methodologies, sequential decision making, and AI safety and alignment.

Google Scholar / Twitter / GitHub / LinkedIn

News

Jun 08, 2026	I joined OpenAI as a Research Program Manager, working on evals
Apr 30, 2026	SWE-Bench Pro, SciPredict, and SpreadsheetArena have been accepted to ICML 2026
Feb 12, 2026	I was on Chain of Thought to talk about ResearchRubrics, a benchmark to evaluate deep research agents
Feb 06, 2026	Reliable Weak-to-Strong Monitoring of LLM Agents has been accepted to ICLR 2026 as an oral!
Jan 28, 2026	Humanity’s Last Exam (HLE) has been published in Nature!
Jan 26, 2026	Reliable Weak-to-Strong Monitoring of LLM Agents, ResearchRubrics, and MoReBench have been accepted to ICLR 2026
Nov 20, 2025	I recorded a podcast! Check out the latest episode of Chain of Thought on PRBench
Nov 13, 2025	We published PRBench, a new benchmark to measure how well LLMs do on prefessional domains
Nov 12, 2025	Check out our new paper ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Aug 28, 2025	We published Reliable Weak-to-Strong Monitoring of LLM Agents, extending our work on SHADE-Arena
Jun 19, 2025	Excited to share our new paper SHADE-Arena, in collaboration with Anthropic
Jan 27, 2025	I am excited to join Scale AI as a ML Research Ops Lead
Jan 22, 2025	ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization has been accepted at ICLR 2025

Selected Publications

2026

Nature

Humanity’s Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, and 2 more authors

Nature, 2026

Bib HTML

@article{phan2025humanity,
  title = {Humanity's Last Exam},
  author = {Phan, Long and Gatti, Alice and Han, Ziwen and Li, Nathaniel and Hu, Josephina and Zhang, Hugh and Zhang, Chen Bo Calvin and Shaaban, Mohamed and Ling, John and Shi, Sean and others},
  journal = {Nature},
  volume = {649},
  number = {8099},
  pages = {1139--1146},
  year = {2026},
  publisher = {Nature Publishing Group UK London},
}

ICML

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, and 8 more authors

The Forty-third International Conference on Machine Learning (ICML), 2026

Bib HTML

@article{sehwag2026scipredict,
  title = {SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?},
  author = {Sehwag, Udari Madhushani and Lau, Elaine and Oskouie, Haniyeh Ehsani and Shabihi, Shayan and Liang, Erich and Toledo, Andrea and Mangialardi, Guillermo and Fonrouge, Sergio and Cardona, Ed-Yeremai Hernandez and Vergara, Paula and Tyagi, Utkarsh and Zhang, Chen Bo Calvin and Bhatter, Pavi and Johnson, Nicholas and Huang, Furong and Montoya, Ernesto Gabriel Hernandez and Liu, Bing},
  journal = {The Forty-third International Conference on Machine Learning (ICML)},
  year = {2026},
}

ICML

SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, and John Ling

The Forty-third International Conference on Machine Learning (ICML), 2026

Bib HTML

@article{kundurthy2026spreadsheetarena,
  title = {SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks},
  author = {Kundurthy, Srivatsa and Na, Clara and Handley, Michael and Kirshner, Zach and Zhang, Chen Bo Calvin and Sharma, Manasi and Strubell, Emma and Ling, John},
  journal = {The Forty-third International Conference on Machine Learning (ICML)},
  year = {2026},
}

arXiv

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Chen Bo Calvin Zhang, Christina Q Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, and 2 more authors

arXiv preprint arXiv:2602.23329, 2026

Bib HTML

@article{zhang2026llm,
  title = {LLM Novice Uplift on Dual-Use, In Silico Biology Tasks},
  author = {Zhang, Chen Bo Calvin and Knight, Christina Q and Kruus, Nicholas and Hausenloy, Jason and Medeiros, Pedro and Li, Nathaniel and Kim, Aiden and Orlovskiy, Yury and Breen, Coleman and Cai, Bryce and others},
  journal = {arXiv preprint arXiv:2602.23329},
  year = {2026},
}

ACL

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, and 2 more authors

2026

Bib HTML

@article{akyurek2025prbench,
  title = {PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning},
  author = {Aky{\"u}rek, Afra Feyza and Gosai, Advait and Zhang, Chen Bo Calvin and Gupta, Vipul and Jeong, Jaehwan and Gunjal, Anisha and Rabbani, Tahseen and Mazzone, Maria and Randolph, David and Meymand, Mohammad Mahmoudi and others},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year = {2026},
}

ICLR

ResearchRubrics: A Benchmark of Prompts and Rubrics for Evaluating Deep Research Agents

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, and 2 more authors

The Fourteenth International Conference on Learning Representations (ICLR), 2026

Bib HTML

@article{sharma2025researchrubrics,
  title = {ResearchRubrics: A Benchmark of Prompts and Rubrics for Evaluating Deep Research Agents},
  author = {Sharma, Manasi and Zhang, Chen Bo Calvin and Bandi, Chaithanya and Wang, Clinton and Aich, Ankit and Nghiem, Huy and Rabbani, Tahseen and Htet, Ye and Jang, Brian and Basu, Sumana and others},
  journal = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year = {2026},
}

ICLR

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More Than Outcomes

Yu Ying Chiu, Michael S Lee, Rachel Calcott, Brandon Handoko, Paul Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, and 2 more authors

The Fourteenth International Conference on Learning Representations (ICLR), 2026

Bib HTML

@article{chiu2025morebench,
  title = {MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More Than Outcomes},
  author = {Chiu, Yu Ying and Lee, Michael S and Calcott, Rachel and Handoko, Brandon and de Font-Reaulx, Paul and Rodriguez, Paula and Zhang, Chen Bo Calvin and Han, Ziwen and Sehwag, Udari Madhushani and Maurya, Yash and others},
  journal = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year = {2026},
}

ICML

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, and 13 more authors

The Forty-third International Conference on Machine Learning (ICML), 2026

Bib HTML

@article{deng2025swe,
  title = {SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?},
  author = {Deng, Xiang and Da, Jeff and Pan, Edwin and He, Yannis Yiming and Ide, Charles and Garg, Kanak and Lauffer, Niklas and Park, Andrew and Pasari, Nitin and Rane, Chetan and Sampath, Karmini and Krishnan, Maya and Kundurthy, Srivatsa and Hendryx, Sean and Wang, Zifan and Bharadwaj, Vijay and Holm, Jeff and Aluri, Raja and Zhang, Chen Bo Calvin and Jacobson, Noah and Liu, Bing and Kenstler, Brad},
  journal = {The Forty-third International Conference on Machine Learning (ICML)},
  year = {2026},
}

ICLR

Reliable Weak-to-Strong Monitoring of LLM Agents (Oral)

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q Knight, and Zifan Wang

The Fourteenth International Conference on Learning Representations (ICLR), 2026

Bib HTML

@article{kale2025reliable,
  title = {Reliable Weak-to-Strong Monitoring of LLM Agents},
  author = {Kale, Neil and Zhang, Chen Bo Calvin and Zhu, Kevin and Aich, Ankit and Rodriguez, Paula and Team, Scale Red and Knight, Christina Q and Wang, Zifan},
  journal = {The Fourteenth International Conference on Learning Representations (ICLR)},
  year = {2026},
  oral = {true}
}

2025

arXiv

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, and 3 more authors

arXiv preprint arXiv:2506.15740, 2025

Bib HTML

@article{kutasov2025shade,
  title = {SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents},
  author = {Kutasov, Jonathan and Sun, Yuqi and Colognese, Paul and van der Weij, Teun and Petrini, Linda and Zhang, Chen Bo Calvin and Hughes, John and Deng, Xiang and Sleight, Henry and Tracy, Tyler and Shlegeris, Buck and Benton, Joe},
  journal = {arXiv preprint arXiv:2506.15740},
  year = {2025},
}

ICLR

ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, and Pulkit Agrawal

In The Thirteenth International Conference on Learning Representations (ICLR), 2025

Bib HTML

@inproceedings{zhang2025orso,
  title = {ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization},
  author = {Zhang, Chen Bo Calvin and Hong, Zhang-Wei and Pacchiano, Aldo and Agrawal, Pulkit},
  booktitle = {The Thirteenth International Conference on Learning Representations (ICLR)},
  year = {2025},
}

2023

ICML

HIP-RL: Hallucinated Inputs for Preference-based Reinforcement Learning in Continuous Domains

Chen Bo Calvin Zhang, and Giorgia Ramponi

In ICML 2023 Workshop: The Many Facets of Preference-Based Learning, 2023

Bib

@inproceedings{zhang2023hip,
  title = {HIP-RL: Hallucinated Inputs for Preference-based Reinforcement Learning in Continuous Domains},
  author = {Zhang, Chen Bo Calvin and Ramponi, Giorgia},
  booktitle = {ICML 2023 Workshop: The Many Facets of Preference-Based Learning},
  year = {2023},
}