Reduce Operation
The Reduce operation in DocETL aggregates data based on a key. It supports both batch reduction and incremental folding for large datasets, making it versatile for various data processing tasks.
Motivation
Reduce operations are essential when you need to summarize or aggregate data across multiple records. For example, you might want to:
- Analyze sentiment trends across social media posts
- Consolidate patient medical records from multiple visits
- Synthesize key findings from a set of research papers on a specific topic
🚀 Example: Summarizing Customer Feedback
Let's look at a practical example of using the Reduce operation to summarize customer feedback by department:
- name: summarize_feedback
type: reduce
reduce_key: department
prompt: |
Summarize the customer feedback for the {{ inputs[0].department }} department:
{% for item in inputs %}
Feedback {{ loop.index }}: {{ item.feedback }}
{% endfor %}
Provide a concise summary of the main points and overall sentiment.
output:
schema:
summary: string
sentiment: string
This Reduce operation processes customer feedback grouped by department:
- Groups all feedback entries by the 'department' key.
- For each department, it applies the prompt to summarize the feedback and determine overall sentiment.
- Outputs a summary and sentiment for each department.
Configuration
Required Parameters
type
: Must be set to "reduce".reduce_key
: The key (or list of keys) to use for grouping data. Use_all
to group all data into one group.prompt
: The prompt template to use for the reduction operation.output
: Schema definition for the output from the LLM.
Optional Parameters
Parameter | Description | Default |
---|---|---|
sample |
Number of samples to use for the operation | None |
synthesize_resolve |
If false, won't synthesize a resolve operation between map and reduce | true |
model |
The language model to use | Falls back to default_model |
input |
Specifies the schema or keys to subselect from each item | All keys from input items |
pass_through |
If true, non-input keys from the first item in the group will be passed through | false |
associative |
If true, the reduce operation is associative (i.e., order doesn't matter) | true |
fold_prompt |
A prompt template for incremental folding | None |
fold_batch_size |
Number of items to process in each fold operation | None |
value_sampling |
A dictionary specifying the sampling strategy for large groups | None |
verbose |
If true, enables detailed logging of the reduce operation | false |
persist_intermediates |
If true, persists the intermediate results for each group to the key _{operation_name}_intermediates |
false |
timeout |
Timeout for each LLM call in seconds | 120 |
max_retries_per_timeout |
Maximum number of retries per timeout | 2 |
litellm_completion_kwargs |
Additional parameters to pass to LiteLLM completion calls. | {} |
Advanced Features
Incremental Folding
For large datasets, the Reduce operation supports incremental folding. This allows processing of large groups in smaller batches, which can be more efficient and reduce memory usage.
To enable incremental folding, provide a fold_prompt
and fold_batch_size
:
- name: large_data_reduce
type: reduce
reduce_key: category
prompt: |
Summarize the data for category {{ inputs[0].category }}:
{% for item in inputs %}
Item {{ loop.index }}: {{ item.data }}
{% endfor %}
fold_prompt: |
Combine the following summaries for category {{ inputs[0].category }}:
Current summary: {{ output.summary }}
New data:
{% for item in inputs %}
Item {{ loop.index }}: {{ item.data }}
{% endfor %}
fold_batch_size: 100
output:
schema:
summary: string
Example Rendered Prompt
Rendered Reduce Prompt
Let's consider an example of how a reduce operation prompt might look when rendered with actual data. Assume we have a reduce operation that summarizes product reviews, and we're processing reviews for a product with ID "PROD123". Here's what the rendered prompt might look like:
Summarize the reviews for product PROD123:
Review 1: This laptop is amazing! The battery life is incredible, lasting me a full day of work without needing to charge. The display is crisp and vibrant, perfect for both work and entertainment. The only minor drawback is that it can get a bit warm during intensive tasks.
Review 2: I'm disappointed with this purchase. While the laptop looks sleek, its performance is subpar. It lags when running multiple applications, and the fan noise is quite noticeable. On the positive side, the keyboard is comfortable to type on.
Review 3: Decent laptop for the price. It handles basic tasks well, but struggles with more demanding software. The build quality is solid, and I appreciate the variety of ports. Battery life is average, lasting about 6 hours with regular use.
Review 4: Absolutely love this laptop! It's lightweight yet powerful, perfect for my needs as a student. The touchpad is responsive, and the speakers produce surprisingly good sound. My only wish is that it had a slightly larger screen.
Review 5: Mixed feelings about this product. The speed and performance are great for everyday use and light gaming. However, the webcam quality is poor, which is a letdown for video calls. The design is sleek, but the glossy finish attracts fingerprints easily.
This example shows how the prompt template is filled with actual review data for a specific product. The language model would then process this prompt to generate a summary of the reviews for the product.
Scratchpad Technique
When doing an incremental reduce, the task may require intermediate state that is not represented in the output. For example, if the task is to determine all features more than one person liked about the product, we need some intermediate state to keep track of the features that have been liked once, so if we see the same feature liked again, we can update the output. DocETL maintains an internal "scratchpad" to handle this.
How it works
- The process starts with an empty accumulator and an internal scratchpad.
- It sequentially folds in batches of more than one element at a time.
- The internal scratchpad tracks additional state necessary for accurately solving tasks incrementally. The LLM decides what to write to the scratchpad.
- During each internal LLM call, the current scratchpad state is used along with the accumulated output and new inputs.
- The LLM updates both the accumulated output and the internal scratchpad, which are used in the next fold operation.
The scratchpad technique is handled internally by DocETL, allowing users to define complex reduce operations without worrying about the complexities of state management across batches. Users provide their reduce and fold prompts focusing on the desired output, while DocETL uses the scratchpad technique behind the scenes to ensure accurate tracking of trends and efficient processing of large datasets.
Value Sampling
For very large groups, you can use value sampling to process a representative subset of the data. This can significantly reduce processing time and costs.
The following table outlines the available value sampling methods:
Method | Description |
---|---|
random | Randomly select a subset of values |
first_n | Select the first N values |
cluster | Use K-means clustering to select representative samples |
semantic_similarity | Select samples based on semantic similarity to a query |
To enable value sampling, add a value_sampling
configuration to your reduce operation. The configuration should specify the method, sample size, and any additional parameters required by the chosen method.
Value Sampling Configuration
- name: sampled_reduce
type: reduce
reduce_key: product_id
prompt: |
Summarize the reviews for product {{ inputs[0].product_id }}:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}
value_sampling:
enabled: true
method: cluster
sample_size: 50
output:
schema:
summary: string
In this example, the Reduce operation will use K-means clustering to select a representative sample of 50 reviews for each product_id.
For semantic similarity sampling, you can use a query to select the most relevant samples. This is particularly useful when you want to focus on specific aspects of the data.
Semantic Similarity Sampling
- name: sampled_reduce_sem_sim
type: reduce
reduce_key: product_id
prompt: |
Summarize the reviews for product {{ inputs[0].product_id }}, focusing on comments about battery life and performance:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}
value_sampling:
enabled: true
method: sem_sim
sample_size: 30
embedding_model: text-embedding-3-small
embedding_keys:
- review
query_text: "Battery life and performance"
output:
schema:
summary: string
In this example, the Reduce operation will use semantic similarity to select the 30 reviews most relevant to battery life and performance for each product_id. This allows you to focus the summarization on specific aspects of the product reviews.
Lineage
The Reduce operation supports lineage, which allows you to track the original input data for each output. This can be useful for debugging and auditing. To enable lineage, add a lineage
configuration to your reduce operation, specifying the keys to include in the lineage. For example:
- name: summarize_reviews_by_category
type: reduce
reduce_key: category
prompt: |
Summarize the reviews for category {{ inputs[0].category }}:
{% for item in inputs %}
Review {{ loop.index }}: {{ item.review }}
{% endfor %}
output:
schema:
summary: string
lineage:
- product_id
This output will include a list of all product_ids for each category in the lineage, saved under the key summarize_reviews_by_category_lineage
.
Best Practices
- Choose Appropriate Keys: Select
reduce_key
(s) that logically group your data for the desired aggregation. - Design Effective Prompts: Create prompts that clearly instruct the model on how to aggregate or summarize the grouped data.
- Consider Data Size: For large datasets, use incremental folding and value sampling to manage processing efficiently.
- Optimize Your Pipeline: Use
docetl build pipeline.yaml
to optimize your pipeline, which can introduce efficient merge operations and resolve steps if needed. - Balance Precision and Efficiency: When dealing with very large groups, consider using value sampling to process a representative subset of the data.