Datasets and Frames
A dataset is the input to a pipeline. Each item in it is one row: an object in a JSON list, a row in a CSV or Parquet file, or one file in a directory.
In the Python API, a Frame is the data structure that represents a
dataset. Readers return a Frame, and every operation on a Frame returns a
new Frame representing the transformed dataset. See
Frame methods below.
Defining a dataset
Datasets are declared at the top level of the config and referenced by name in the pipeline's steps:
datasets:
user_logs:
type: file
path: "user_logs.json"
A reader loads a dataset and returns a Frame:
import docetl
user_logs = docetl.read_json("user_logs.json") # or read_csv, read_parquet, read_dir
in_memory = docetl.from_list([{"text": "..."}]) # from a list of dicts
Chaining operations like .map() records them without running
anything; execution happens at an action like .collect(). See
Frame methods.
Examples
A JSON file
A list of objects; each object is one row.
// reviews.json
[
{"id": 1, "product": "headphones", "review": "Battery died after a week."},
{"id": 2, "product": "keyboard", "review": "Keys feel great, very quiet."}
]
datasets:
reviews:
type: file
path: "reviews.json"
reviews = docetl.read_json("reviews.json")
A CSV or Parquet file
Each row of the table is one row of the dataset; column names become keys.
ticket_id,customer,message
101,acme,"Cannot log in since the update"
102,globex,"Invoice total looks wrong"
datasets:
tickets:
type: file
path: "tickets.csv" # or .parquet
tickets = docetl.read_csv("tickets.csv") # or read_parquet(...)
A directory of documents
Every non-hidden file under the directory (recursively) becomes one row:
text holds the file's content, with filename and path alongside. PDF,
Word, PowerPoint, and Excel files are converted to text; other files are read
as UTF-8; binary files with no extractor are skipped with a warning.
contracts/
acme_msa.pdf
globex_nda.docx
notes/renewal_2026.txt
datasets:
contracts:
type: file
path: "contracts"
contracts = docetl.read_dir("contracts")
# one row per file:
# {
# "filename": "acme_msa.pdf",
# "path": "contracts/acme_msa.pdf",
# "text": "MASTER SERVICE AGREEMENT\nThis Agreement is entered into by...",
# }
An in-memory list (Python only)
docs = docetl.from_list([
{"speaker": "patient", "utterance": "The headaches started last month."},
{"speaker": "doctor", "utterance": "Any changes in vision?"},
])
Relative paths resolve against the directory you run from, not the location of the YAML file or Python script.
Frame methods
A Frame is lazy and immutable: each operation records a step and returns a new Frame, and nothing runs until an action.
- Readers create a Frame:
docetl.read_json,read_csv,read_parquet,read_dir,from_list, andFrame.from_yaml. - Operations return a new Frame:
map,filter,reduce,resolve,equijoin,extract,split,gather,unnest,cluster,sample,parallel_map, and the code variantscode_map,code_filter, andcode_reduce. - Actions run the pipeline:
collect()(rows as a list of dicts),to_pandas()(a pandas DataFrame),show()(run on a few rows and print),count(), andwrite_json()/write_csv()/write_parquet(). - Inspection and export:
schema()(output fields, computed without running),total_costandtoken_usage(after a run),to_yaml()andto_python()(export the pipeline), andoptimize()(run the MOAR optimizer).
See the Python API reference for full signatures.
Parsing tools (non-standard inputs)
DocETL ships built-in parsing functions for file types beyond the above,
e.g., whisper_speech_to_text for audio, and you can register your own.
See Custom Parsing for the available
built-in tools and how to define custom ones.
To use one, point the dataset at a JSON file of paths and attach the parsing function:
datasets:
audio_transcripts:
type: file
source: local
path: "audio_files/audio_paths.json"
parsing_tools:
- input_key: audio_path
function: whisper_speech_to_text
output_key: transcript
import docetl
audio_transcripts = docetl.read_json(
"audio_files/audio_paths.json",
parsing=[
{
"input_key": "audio_path",
"function": "whisper_speech_to_text",
"output_key": "transcript",
}
],
)
input_key: the key holding the path to the file to parse.function: the parsing function (built-in or custom).output_key: the key the parsed content is stored under, accessible in prompts as{{ input.transcript }}.