Reduce Operation
The Reduce operation aggregates data based on a key. It supports both batch reduction and incremental folding for large datasets. Examples: consolidating patient records from multiple visits, or synthesizing findings from a set of research papers.
flowchart LR
a1["doc (key=A)"] --> A["one output for A"]
a2["doc (key=A)"] --> A
b1["doc (key=B)"] --> B["one output for B"]
c1["..."] --> C["..."]
Example: Summarizing Customer Feedback
- name: summarize_feedback
type: reduce
reduce_key: department
prompt: |
Summarize the customer feedback for the {{ inputs[0].department }} department:
{% for item in inputs %}
Feedback {{ loop.index }}: {{ item.feedback }}
{% endfor %}
Provide a concise summary of the main points and overall sentiment.
output:
schema:
summary: string
sentiment: string
import docetl
docetl.default_model = "gpt-4o-mini"
frame = docetl.read_json("feedback.json")
frame = frame.reduce(
reduce_key="department",
prompt="""Summarize the customer feedback for the {{ inputs[0].department }} department:
{% for item in inputs %}
Feedback {{ loop.index }}: {{ item.feedback }}
{% endfor %}
Provide a concise summary of the main points and overall sentiment.""",
output={
"schema": {
"summary": "string",
"sentiment": "string",
}
},
)
rows = frame.collect()
Configuration
Required Parameters
type: Must be set to "reduce".reduce_key: The key (or list of keys) to use for grouping data. Use_allto group all data into one group.prompt: The prompt template to use for the reduction operation.output: Schema definition for the output from the LLM.
Optional Parameters
| Parameter | Description | Default |
|---|---|---|
sample |
Number of samples to use for the operation | None |
limit |
Maximum number of groups to process before stopping | All groups |
synthesize_resolve |
If false, won't synthesize a resolve operation between map and reduce | true |
model |
The language model to use | Falls back to default_model |
input |
Specifies the schema or keys to subselect from each item | All keys from input items |
pass_through |
If true, non-input keys from the first item in the group will be passed through | false |
associative |
If true, the reduce operation is associative (i.e., order doesn't matter) | true |
fold_prompt |
A prompt template for incremental folding | None |
fold_batch_size |
Number of items to process in each fold operation | None |
value_sampling |
A dictionary specifying the sampling strategy for large groups | None |
verbose |
If true, enables detailed logging of the reduce operation | false |
persist_intermediates |
If true, persists the intermediate results for each group to the key _{operation_name}_intermediates |
false |
timeout |
Timeout for each LLM call in seconds | 120 |
max_retries_per_timeout |
Maximum number of retries per timeout | 2 |
litellm_completion_kwargs |
Additional parameters to pass to LiteLLM completion calls. | {} |
agent |
Python-only docetl.Agent config for tool-equipped reduce agents. |
None |
bypass_cache |
If true, bypass the cache for this operation. | False |
retriever |
Name of a retriever to use for RAG. See Retrievers. | None |
save_retriever_output |
If true, saves the retrieved context to _<operation_name>_retrieved_context in the output. |
False |
Limiting group processing
Set limit to stop after N groups:
- Groups are sorted by size (smallest first) and only the N smallest groups are processed; the rest are never scheduled, so you avoid extra fold/merge calls.
- If a grouped reduce returns more than one record per group, the final output list is truncated to
limit.
Tool-equipped reduce agents
In Python, reduce supports agent=docetl.Agent(...) when each group needs
tools before producing the final aggregate. The operation-level model= remains
the model used for the reduce call.
import docetl
@docetl.tool
def get_region_target(region: str) -> dict[str, str | int]:
"""Return sales target context for a region."""
return {
"na": {"pipeline_target": 2_500_000, "focus": "enterprise expansion"},
"emea": {"pipeline_target": 1_800_000, "focus": "regulated industries"},
"apac": {"pipeline_target": 1_200_000, "focus": "partner-sourced deals"},
}[region.lower()]
agent = docetl.Agent(tools=[get_region_target], max_turns=5, max_tool_calls=3)
frame = frame.reduce(
reduce_key="region",
prompt=(
"Use get_region_target for {{ inputs[0].region }}, then summarize the "
"opportunities in this group and compare them with the target: {{ inputs }}"
),
output={
"schema": {
"region_summary": "str",
"target_gap": "str",
"recommended_actions": "list[str]",
}
},
model="azure/gpt-4o-mini",
agent=agent,
)
Agent configs are Python-only and cannot be exported to YAML. Reduce with
agent= cannot currently be combined with gleaning.
See the Python API reference for the full API and the Tool-Equipped Agents tutorial for a map/reduce example with web search, hosted shell, and specialist subagents.
Advanced Features
Incremental Folding
Incremental folding processes large groups in smaller batches. To enable it, provide a fold_prompt and fold_batch_size:
- name: large_data_reduce
type: reduce
reduce_key: category
prompt: |
Summarize the data for category {{ inputs[0].category }}:
{% for item in inputs %}
Item {{ loop.index }}: {{ item.data }}
{% endfor %}
fold_prompt: |
Combine the following summaries for category {{ inputs[0].category }}:
Current summary: {{ output.summary }}
New data:
{% for item in inputs %}
Item {{ loop.index }}: {{ item.data }}
{% endfor %}
fold_batch_size: 100
output:
schema:
summary: string
frame = frame.reduce(
name="large_data_reduce",
reduce_key="category",
prompt="""Summarize the data for category {{ inputs[0].category }}:
{% for item in inputs %}
Item {{ loop.index }}: {{ item.data }}
{% endfor %}""",
fold_prompt="""Combine the following summaries for category {{ inputs[0].category }}:
Current summary: {{ output.summary }}
New data:
{% for item in inputs %}
Item {{ loop.index }}: {{ item.data }}
{% endfor %}""",
fold_batch_size=100,
output={"schema": {"summary": "string"}},
)
Example Rendered Prompt
Rendered Reduce Prompt
A reduce prompt that summarizes product reviews, rendered for product "PROD123":
Summarize the reviews for product PROD123:
Review 1: This laptop is amazing! The battery life is incredible, lasting me a full day of work without needing to charge. The display is crisp and vibrant, perfect for both work and entertainment. The only minor drawback is that it can get a bit warm during intensive tasks.
Review 2: I'm disappointed with this purchase. While the laptop looks sleek, its performance is subpar. It lags when running multiple applications, and the fan noise is quite noticeable. On the positive side, the keyboard is comfortable to type on.
Review 3: Decent laptop for the price. It handles basic tasks well, but struggles with more demanding software. The build quality is solid, and I appreciate the variety of ports. Battery life is average, lasting about 6 hours with regular use.
Review 4: Absolutely love this laptop! It's lightweight yet powerful, perfect for my needs as a student. The touchpad is responsive, and the speakers produce surprisingly good sound. My only wish is that it had a slightly larger screen.
Review 5: Mixed feelings about this product. The speed and performance are great for everyday use and light gaming. However, the webcam quality is poor, which is a letdown for video calls. The design is sleek, but the glossy finish attracts fingerprints easily.
Scratchpad Technique
An incremental reduce may require intermediate state not represented in the output (e.g., to find all features liked by more than one person, you must track features liked once so far). DocETL maintains an internal "scratchpad" for this; users only write reduce and fold prompts.
How it works:
- The process starts with an empty accumulator and an internal scratchpad.
- Each fold's LLM call receives the current scratchpad state, the accumulated output, and the new batch of inputs.
- The LLM updates both the accumulated output and the scratchpad (deciding what to write), and both are used in the next fold.
Value Sampling
For very large groups, value sampling processes a representative subset of the data. Available methods:
| Method | Description |
|---|---|
| random | Randomly select a subset of values |
| first_n | Select the first N values |
| cluster | Use K-means clustering to select representative samples |
| semantic_similarity | Select samples based on semantic similarity to a query |
To enable value sampling, add a value_sampling configuration specifying the method, sample size, and any parameters the method requires.
Value Sampling Configuration
- name: sampled_reduce
type: reduce
reduce_key: product_id
prompt: |
Summarize the reviews for product {{ inputs[0].product_id }}:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}
value_sampling:
enabled: true
method: cluster
sample_size: 50
output:
schema:
summary: string
frame = frame.reduce(
name="sampled_reduce",
reduce_key="product_id",
prompt="""Summarize the reviews for product {{ inputs[0].product_id }}:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}""",
value_sampling={
"enabled": True,
"method": "cluster",
"sample_size": 50,
},
output={"schema": {"summary": "string"}},
)
For semantic similarity sampling, a query selects the samples most relevant to specific aspects of the data.
Semantic Similarity Sampling
- name: sampled_reduce_sem_sim
type: reduce
reduce_key: product_id
prompt: |
Summarize the reviews for product {{ inputs[0].product_id }}, focusing on comments about battery life and performance:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}
value_sampling:
enabled: true
method: sem_sim
sample_size: 30
embedding_model: text-embedding-3-small
embedding_keys:
- review
query_text: "Battery life and performance"
output:
schema:
summary: string
frame = frame.reduce(
name="sampled_reduce_sem_sim",
reduce_key="product_id",
prompt="""Summarize the reviews for product {{ inputs[0].product_id }}, focusing on comments about battery life and performance:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}""",
value_sampling={
"enabled": True,
"method": "sem_sim",
"sample_size": 30,
"embedding_model": "text-embedding-3-small",
"embedding_keys": ["review"],
"query_text": "Battery life and performance",
},
output={"schema": {"summary": "string"}},
)
Lineage
Lineage tracks the original input data for each output, useful for debugging and auditing. To enable it, add a lineage list to the output config specifying the keys to include:
- name: summarize_reviews_by_category
type: reduce
reduce_key: category
prompt: |
Summarize the reviews for category {{ inputs[0].category }}:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}
output:
schema:
summary: string
lineage:
- product_id
frame = frame.reduce(
name="summarize_reviews_by_category",
reduce_key="category",
prompt="""Summarize the reviews for category {{ inputs[0].category }}:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}""",
output={
"schema": {"summary": "string"},
"lineage": ["product_id"],
},
)
This output will include a list of all product_ids for each category in the lineage, saved under the key summarize_reviews_by_category_lineage.
Best Practices
- Consider Data Size: For large datasets, use incremental folding; for very large groups, use value sampling.
- Optimize Your Pipeline: Use
docetl build pipeline.yamlto optimize your pipeline, which can introduce efficient merge operations and resolve steps if needed.