What is the Data Agent Benchmark?
Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text.
DAB is the first benchmark that tests agents on these challenges. It covers 12 real-world datasets across 9 domains and 4 database systems (PostgreSQL, MongoDB, SQLite, DuckDB).
Datasets
12
Queries
54
Domains
9
DBMSes
4
From EPIC Data Lab, UC Berkeley and Hasura PromptQL.
Submit to the Leaderboard
Run your agent on all 54 queries with at least n = 5 trials/query and open a pull request with your results JSON.
- Collect one result per
dataset × query × trial. - Package into one JSON file.
- Open a PR with your JSON and agent details.
[
{
"dataset": "bookreview",
"query": "1",
"run": 0,
"answer": "2020s"
},
...
]
Full instructions in the README.
Leaderboard
Pass@1 = fraction of queries answered correctly on the first attempt, averaged across n trials per query. The overall score is computed per-dataset first, then averaged across datasets (stratified).
| # | Agent | Team | n | Pass@1 | Date |
|---|
Per-Dataset Pass@1
Sort by agent to compare dataset difficulty.