Python API Reference
The Python API lets you build, run, and optimize DocETL pipelines with chainable operations.
import docetl
Deprecated: the typed Pipeline class
The older object API (from docetl.api import Pipeline with MapOp,
ReduceOp, etc.) is deprecated in favor of the Frame API documented on
this page.
Quick start
import docetl
docetl.default_model = "gpt-4o-mini"
results = (
docetl.read_json("input.json")
.map(
prompt="Classify this document: {{ input.text }}",
output={"schema": {"category": "string"}},
)
.filter(prompt="Is this document about technology? {{ input.text }}")
.reduce(
reduce_key="category",
prompt="Summarize these documents: {% for item in inputs %}{{ item.text }}{% endfor %}",
output={"schema": {"summary": "string"}},
)
.collect()
)
Configuration
Set global defaults as module-level attributes:
| Attribute | Type | Default | Description |
|---|---|---|---|
docetl.default_model |
str |
None |
Default LLM model for all operations |
docetl.agent_model |
str |
None |
Model for optimizer rewrites |
docetl.max_threads |
int |
cpu_count * 4 |
Concurrent threads |
docetl.bypass_cache |
bool |
False |
Skip LLM cache |
docetl.intermediate_dir |
str |
None |
Directory for intermediate results (a relative path resolves against the working directory at run time) |
docetl.rate_limits |
dict |
None |
Rate limits per model |
docetl.fallback_models |
list[str] |
None |
Fallback chain on failure |
docetl.fallback_embedding_models |
list[str] |
None |
Fallback embedding models |
docetl.system_prompt |
dict |
None |
{"dataset_description": ..., "persona": ...} applied to all operations |
Precedence. Settings layer from most to least specific: a per-operation
parameter (e.g. model= on .map()) beats per-pipeline settings carried by a
Frame (set when loading a YAML via Frame.from_yaml), which beat the
module-level docetl.* globals above, which beat built-in defaults. Loading a
YAML never changes the globals — its settings travel with that Frame only.
Reading Data
| Function | Description |
|---|---|
docetl.read_json(path) |
Load from JSON file |
docetl.read_csv(path) |
Load from CSV file |
docetl.read_parquet(path) |
Load from Parquet file |
docetl.read_dir(path) |
One row per file in a directory (recursive): text is the file's content, plus filename and path. PDF/Word/PowerPoint/Excel are converted to text; other files read as UTF-8 |
docetl.from_list(data) |
Load from a list of dicts |
docetl.Frame.from_yaml(path) |
Load from a YAML pipeline config |
All return a Frame.
Frame
A lazy pipeline — operations are recorded but not executed until a terminal action is called. Frames are immutable; every operation returns a new Frame.
LLM Operations
.map()
Applies an LLM prompt to each document independently.
frame.map(
prompt="Classify: {{ input.text }}",
output={"schema": {"category": "str"}},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
— | Jinja2 template. Access fields via {{ input.key }}. |
output |
dict |
— | Output schema, e.g. {"schema": {"field": "str"}} |
model |
str |
None |
Override default model |
validate |
list[str \| callable] |
None |
Validators: expression strings over output, or callables taking the output dict (callables can't be exported to YAML) |
num_retries_on_validate_failure |
int |
None |
Retries on validation failure |
sample |
int |
None |
Process only N documents |
agent |
docetl.Agent |
None |
Tool-equipped agent using the OpenAI Agents SDK |
drop_keys |
list[str] |
None |
Keys to remove from output |
timeout |
int |
None |
Timeout per LLM call (seconds) |
max_batch_size |
int |
None |
Batch size for batch processing |
batch_prompt |
str |
None |
Jinja2 template for batch mode |
retriever |
Retriever |
None |
Retriever for context augmentation |
optimize |
bool |
None |
Mark for optimization |
limit |
int |
None |
Max documents to process |
.filter()
Keeps or removes documents based on an LLM prompt. Output schema must have one boolean field.
frame.filter(
prompt="Is this about technology? {{ input.text }}",
output={"schema": {"is_tech": "bool"}},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
— | Jinja2 template |
output |
dict |
— | Schema with one boolean field |
model |
str |
None |
Override default model |
validate |
list[str \| callable] |
None |
Validators: expression strings or callables over the output dict |
agent |
docetl.Agent |
None |
Tool-equipped agent using the OpenAI Agents SDK |
retriever |
Retriever |
None |
Retriever for context augmentation |
cascade |
dict |
None |
Model cascade: run a cheap proxy (chat or embedding model) on all items and escalate only uncertain ones, with a statistical guarantee. Also available on resolve and equijoin. |
frame.filter(
prompt="Is this review about shipping problems? {{ input.text }}",
output={"schema": {"keep": "bool"}},
cascade={"proxy_model": "text-embedding-3-small", "guarantee": "recall",
"target": 0.9, "label_budget": 120},
)
.reduce()
Groups documents by key and reduces each group with an LLM.
frame.reduce(
reduce_key="category",
prompt="Summarize: {% for item in inputs %}{{ item.text }}{% endfor %}",
output={"schema": {"summary": "str"}},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
reduce_key |
str \| list[str] |
— | Key(s) to group by. Use "_all" for one group. |
prompt |
str |
— | Jinja2 template. Iterate with {% for item in inputs %}. |
output |
dict |
— | Output schema |
fold_prompt |
str |
None |
Prompt for incremental folding |
fold_batch_size |
int |
None |
Items per fold iteration |
merge_prompt |
str |
None |
Prompt for merging fold results |
pass_through |
bool |
None |
Pass through non-reduced keys |
associative |
bool |
None |
Enable parallel reduction |
agent |
docetl.Agent |
None |
Tool-equipped agent using the OpenAI Agents SDK |
retriever |
Retriever |
None |
Retriever for context augmentation |
Tool-equipped map/filter/reduce
Use docetl.Agent when an operation should call tools over multiple turns
before returning DocETL's structured output. DocETL adapts Python functions into
OpenAI Agents SDK tools and routes the operation's LiteLLM model through the
SDK's LiteLLM integration.
import docetl
@docetl.tool
def lookup_sla(customer_tier: str) -> dict[str, str | int]:
"""Return support entitlements for a customer tier."""
return {
"enterprise": {"response_hours": 1, "escalation": "page-on-call"},
"growth": {"response_hours": 4, "escalation": "queue-lead"},
"free": {"response_hours": 48, "escalation": "self-serve"},
}.get(customer_tier.lower(), {"response_hours": 24, "escalation": "standard"})
agent = docetl.Agent(tools=[lookup_sla], max_turns=5, max_tool_calls=3)
rows = (
docetl.from_list([
{
"ticket": "Production API latency is above the SLO.",
"customer_tier": "enterprise",
}
])
.map(
prompt=(
"Use lookup_sla to classify this ticket and explain the next action: "
"{{ input.ticket }} / tier={{ input.customer_tier }}"
),
output={"schema": {"priority": "str", "next_action": "str"}},
model="azure/gpt-4o-mini",
agent=agent,
)
.collect()
)
The model is still specified on the operation (model=) or through
docetl.default_model; use LiteLLM model names such as azure/gpt-4o-mini,
anthropic/..., or together_ai/.... The selected model/provider must work
with the OpenAI Agents SDK LiteLLM integration and the requested tools. Plain
Python function tools execute as trusted Python in your process; OpenAI Agents
SDK sandbox/native tools may provide isolation where that SDK/backend supports
it. Agent configs are Python-only and cannot be exported to YAML. See OpenAI's
Agents SDK guide and
tools guide for hosted
tool behavior.
Agents can also expose specialist agents as tools. Use this when a manager agent should keep ownership of the DocETL output while delegating a bounded task:
specialist = docetl.Agent(
instructions="Extract numeric evidence from the supplied text.",
max_turns=4,
)
manager = docetl.Agent(
tools=[
specialist.as_tool(
name="extract_numeric_evidence",
description="Extract numeric evidence for the manager agent.",
)
],
max_turns=6,
)
The specialist uses the same operation-level LiteLLM model as the manager unless you pass an SDK-native tool object with its own model configuration. This follows OpenAI's agents-as-tools orchestration pattern.
For a persistent shell/sandbox filesystem on OpenAI, create a hosted container once and bind shell tools to it:
sandbox = docetl.tools.Sandbox.create(
name="docetl-research",
network="disabled",
memory_limit="1g",
)
bash = sandbox.bash()
agent = docetl.Agent(
tools=[bash],
max_turns=4,
max_tool_calls=6,
)
Reuse bash across a manager and its specialist subagents when they should read
and write the same hosted filesystem. If you only need an ephemeral shell tool,
use docetl.tools.bash(...), which uses container_auto. Pass durable data
between DocETL operations through output schemas, not only through hidden
sandbox files.
docetl.tools.Sandbox.create(...) is OpenAI-specific: it calls the OpenAI
Containers API and returns shell tools using container_reference. It is not
portable to Claude, Together, or other LiteLLM providers. For provider-portable
tooling, use @docetl.tool Python functions, MCP tools, or provider-native SDK
tools. See OpenAI's
sandbox agents guide.
For an end-to-end example, see the Tool-Equipped Agents tutorial.
.resolve()
Deduplicates entities by pairwise LLM comparison.
frame.resolve(
comparison_prompt="Same person? {{ input1.name }} vs {{ input2.name }}",
resolution_prompt="Canonical name: {% for e in inputs %}{{ e.name }}{% endfor %}",
output={"schema": {"name": "str"}},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
comparison_prompt |
str |
— | Jinja2 template comparing {{ input1 }} and {{ input2 }} |
resolution_prompt |
str |
None |
Prompt for resolving matched groups |
output |
dict |
None |
Output schema |
blocking_keys |
list[str] |
None |
Keys for blocking |
blocking_threshold |
float |
None |
Similarity threshold |
embedding_model |
str |
None |
Model for blocking embeddings |
optimize |
bool |
None |
Mark for optimization |
.extract()
Extracts information from documents with line-level precision.
frame.extract(
prompt="Extract key findings from this paper.",
document_keys=["content"],
)
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
— | Extraction prompt |
document_keys |
list[str] |
— | Keys containing document text |
retriever |
Retriever |
None |
Retriever for context augmentation |
.parallel_map()
Runs multiple prompts on each document in parallel.
frame.parallel_map(
prompts=[{"prompt": "...", "output_keys": ["field1"]}, ...],
output={"schema": {"field1": "str", "field2": "str"}},
)
.equijoin()
Joins two datasets by LLM comparison.
right = docetl.read_json("other.json")
frame.equijoin(right, comparison_prompt="Are these related? {{ left.x }} {{ right.y }}")
Structural Operations (no LLM calls)
| Method | Description |
|---|---|
.split(split_key, method, method_kwargs) |
Split documents into chunks |
.gather(content_key, doc_id_key, order_key) |
Add surrounding context to chunks |
.unnest(unnest_key) |
Flatten a list field into separate rows |
.cluster(embedding_keys) |
Cluster documents by embedding similarity |
.sample(samples, method) |
Sample a subset of documents |
Code Operations (no LLM calls)
code takes any callable, e.g., a lambda, or a string of Python source defining a transform function (the YAML form).
| Method | Description |
|---|---|
.code_map(code=lambda doc: {...}) |
Per-document Python transform |
.code_filter(code=lambda doc: bool(...)) |
Per-document Python filter (return bool) |
.code_reduce(reduce_key, code=lambda items: {...}) |
Per-group Python aggregation |
Retriever
Augment LLM operations with retrieved context from a LanceDB index.
retriever = docetl.Retriever(
dataset="knowledge_base",
index_dir="./lance_index",
index_types=["fts", "embedding"],
fts={"index_phrase": "{{ input.text }}", "query_phrase": "{{ input.question }}"},
embedding={"model": "text-embedding-3-small", "index_phrase": "...", "query_phrase": "..."},
query={"mode": "hybrid", "top_k": 5},
)
| Parameter | Type | Default | Description |
|---|---|---|---|
data |
str \| list[dict] |
— | File path or in-memory records to index (alternative to dataset) |
dataset |
str |
— | Name of an existing pipeline dataset or step output to index |
index_dir |
str |
— | Directory for the LanceDB index |
index_types |
list[str] |
— | ["fts"], ["embedding"], or both |
fts |
dict |
None |
Full-text search config (index_phrase, query_phrase) |
embedding |
dict |
None |
Embedding config (model, index_phrase, query_phrase) |
query |
dict |
None |
Query config (mode, top_k) |
build_index |
str |
"if_missing" |
"if_missing", "always", or "never" |
Pass to any LLM operation via retriever=. Retrieved context is available as {{ retrieval_context }} in prompts.
Pass exactly one of data (a file path or list of dicts to index) or dataset (the name of an existing pipeline dataset — the frame's own input or a previous step's output, step_<operation_name>). See the Retrievers guide.
Inspection (no execution)
| Method | Return Type | Description |
|---|---|---|
frame.schema() |
dict[str, str] |
Output schema from operation definitions, including structural ops (split/unnest/gather/extract). Best-effort: code_* op outputs can't be known statically |
frame.count() |
int |
Input count (no ops) or output count (executes if ops present) |
frame.to_yaml() |
str |
Export pipeline as YAML config |
frame.to_yaml(path) |
str |
Also write YAML to file |
frame.to_python() |
str |
Export as Python source code |
Terminal Actions
| Method | Return Type | Description |
|---|---|---|
frame.show(max=5) |
DataFrame |
Run on a sample and print results. Works on bare datasets too. |
frame.collect() |
list[dict] |
Execute the pipeline, return the result rows |
frame.to_pandas() |
DataFrame |
Execute the pipeline, return a pandas DataFrame |
frame.write_json(path) |
None |
Execute and write to JSON |
frame.write_csv(path) |
None |
Execute and write to CSV |
frame.write_parquet(path) |
None |
Execute and write to Parquet |
Terminal actions are memoized on the Frame: repeated calls with an unchanged configuration reuse the previous result instead of re-running the pipeline. Changing ops, in-memory data, or docetl.* settings invalidates the memo; edits to input files between calls are not detected.
Cost & Token Tracking
rows = frame.collect()
print(f"Cost: ${frame.total_cost:.4f}")
print(f"Tokens: {frame.token_usage}")
df = frame.to_pandas()
print(f"Cost: ${df.attrs['_total_cost']:.4f}") # also on the DataFrame
Optimization
@docetl.register_eval
def evaluate(results):
correct = sum(1 for r in results if r.get("correct"))
return {"score": correct}
optimized = frame.optimize(
eval_fn=evaluate,
metric_key="score",
)
| Parameter | Type | Default | Description |
|---|---|---|---|
eval_fn |
Callable |
required | Scores pipeline output. Returns a dict of metrics. |
metric_key |
str |
required | Key in eval_fn's return dict to optimize |
models |
list[str] |
Auto-detect | LiteLLM model names to explore |
agent_model |
str |
Auto-select | Model for rewrite agent |
max_iterations |
int |
20 |
Search budget |
save_dir |
str |
Temp dir | Where to save results |
exploration_weight |
float |
1.414 |
UCB exploration constant |
dataset_path |
str |
Pipeline's dataset | Sample dataset for optimization |
max_threads |
int |
None |
Max concurrent LLM calls per run |
max_concurrent_agents |
int |
3 |
Parallel MCTS search agents |
Returns an optimized Frame. Access search results via optimized.search_results:
| Method / Property | Return Type | Description |
|---|---|---|
.best() |
OptimizedPipeline |
Highest-accuracy solution |
.cheapest() |
OptimizedPipeline |
Lowest-cost solution |
.frontier |
list[OptimizedPipeline] |
All Pareto-optimal solutions |
.to_df() |
DataFrame |
All explored plans |