Oolong is a challenging long-context reasoning benchmark. Questions require multi-step reasoning to identify relevant sections of input, label or categorize these sections, and then accumulate information to make distributional-level decisions. See our paper and GitHub repo for details.
Questions about Oolong? Have a model you'd like us to evaluate? Reach out to abertsch@cs.cmu.edu.
@misc{bertsch2025oolongevaluatinglongcontext,
title={Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities},
author={Amanda Bertsch and Adithya Pratapa and Teruko Mitamura and Graham Neubig and Matthew R. Gormley},
year={2025},
eprint={2511.02817},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.02817},
}