A Python library of utilities for reproducible LLM evaluation.
An open dataset for evaluating robust visual question answering under distribution shift.