DAB logo

What is the Data Agent Benchmark?

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text.

DAB is the first benchmark that tests agents on these challenges. It covers 12 real-world datasets across 9 domains and 4 database systems (PostgreSQL, MongoDB, SQLite, DuckDB).

Datasets

12

Queries

54

Domains

9

DBMSes

4

From EPIC Data Lab, UC Berkeley and Hasura PromptQL.


Submit to the Leaderboard

Run your agent on all 54 queries with at least n = 5 trials/query and open a pull request with your results JSON.

  1. Collect one result per dataset × query × trial.
  2. Package into one JSON file.
  3. Open a PR with your JSON and agent details.
[
  {
    "dataset": "bookreview",
    "query": "1",
    "run": 0,
    "answer": "2020s"
  },
  ...
]

Full instructions in the README.

Leaderboard

Pass@1 = fraction of queries answered correctly on the first attempt, averaged across n trials per query. The overall score is computed per-dataset first, then averaged across datasets (stratified).

# Agent Team n Pass@1 Date

Per-Dataset Pass@1

Sort by agent to compare dataset difficulty.