Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Oolong is a challenging long-context reasoning benchmark. Questions require multi-step reasoning to identify relevant sections of input, label or categorize these sections, and then accumulate information to make distributional-level decisions. See our paper and GitHub repo for details.

Leaderboard

Performance over Context Length

Team

Contact

Questions about Oolong? Have a model you'd like us to evaluate? Reach out to abertsch@cs.cmu.edu.

Citation

@misc{bertsch2025oolongevaluatinglongcontext,
      title={Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities}, 
      author={Amanda Bertsch and Adithya Pratapa and Teruko Mitamura and Graham Neubig and Matthew R. Gormley},
      year={2025},
      eprint={2511.02817},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.02817}, 
}