Split Operation
The Split operation in DocETL is designed to divide long text content into smaller, manageable chunks. This is particularly useful when dealing with large documents that exceed the token limit of language models or when the LLM's performance degrades with increasing input size for complex tasks.
Motivation
Some common scenarios where the Split operation is valuable include:
- Processing long customer support transcripts to analyze specific sections
- Dividing extensive research papers or reports for detailed analysis
- Breaking down large legal documents to extract relevant clauses or sections
- Preparing long-form content for summarization or topic extraction
🚀 Operation Example: Splitting Customer Support Transcripts
Here's an example of using the Split operation to divide customer support transcripts into manageable chunks:
- name: split_transcript
type: split
split_key: transcript
method: token_count
method_kwargs:
num_tokens: 500
model: gpt-4o-mini
This Split operation processes long customer support transcripts:
- Splits the 'transcript' field into chunks of approximately 500 tokens each.
- Uses the gpt-4o-mini model's tokenizer for accurate token counting.
- Generates multiple output items for each input item, one for each chunk.
Note that chunks will not overlap in content.
Configuration
Required Parameters
type
: Must be set to "split".split_key
: The key of the field containing the text to split.method
: The method to use for splitting. Options are "delimiter" and "token_count".method_kwargs
: A dictionary of keyword arguments for the splitting method.- For "delimiter" method:
delimiter
(string) to use for splitting. - For "token_count" method:
num_tokens
(integer) specifying the maximum number of tokens per chunk.
Optional Parameters in `method_kwargs
Parameter | Description | Default |
---|---|---|
model |
The language model's tokenizer to use | Falls back to default_model |
num_splits_to_group |
Number of splits to group together into one chunk (only for "delimiter" method) | 1 |
sample |
Number of samples to use for the operation | None |
Splitting Methods
Token Count Method
The token count method splits the text into chunks based on a specified number of tokens. This is useful when you need to ensure that each chunk fits within the token limit of your language model, or you know that smaller chunks lead to higher performance.
Delimiter Method
The delimiter method splits the text based on a specified delimiter string. This is particularly useful when you want to split your text at logical boundaries, such as paragraphs or sections.
Delimiter Method Example
If you set the delimiter
to "\n\n"
(double newline) and num_splits_to_group
to 3, each chunk will contain 3 paragraphs.
- name: split_by_paragraphs
type: split
split_key: document
method: delimiter
method_kwargs:
delimiter: "\n\n"
num_splits_to_group: 3
Output
The Split operation generates multiple output items for each input item:
- All original key-value pairs from the input item.
{split_key}_chunk
: The content of the split chunk.{op_name}_id
: A unique identifier for each original document.{op_name}_chunk_num
: The sequential number of the chunk within its original document.
Use Cases
-
Analyzing Customer Frustration: Split long support transcripts, then use a map operation to identify frustration indicators in each chunk, followed by a reduce operation to summarize frustration points across the chunks (per transcript).
-
Document Summarization: Split large documents, apply a map operation for section-wise summarization, then use a reduce operation to compile an overall summary.
-
Topic Extraction from Research Papers: Divide research papers into sections, use a map operation to extract key topics from each section, then apply a reduce operation to synthesize main themes across the entire paper.
🚀 End-to-End Pipeline Example: Analyzing Customer Frustration
Let's walk through a complete example of using Split, Map, and Reduce operations to analyze customer frustration in support transcripts.
Step 1: Split Operation
- name: split_transcript
type: split
split_key: transcript
method: token_count
method_kwargs:
num_tokens: 500
model: gpt-4o-mini
Step 2: Map Operation (Identify Frustration Indicators)
- name: identify_frustration
type: map
input:
- transcript_chunk
prompt: |
Analyze the following customer support transcript chunk for signs of customer frustration:
{{ input.transcript_chunk }}
Identify any indicators of frustration, such as:
1. Use of negative language
2. Repetition of issues
3. Expressions of dissatisfaction
4. Requests for escalation
Provide a list of frustration indicators found, if any.
output:
schema:
frustration_indicators: list[string]
Step 3: Reduce Operation (Summarize Frustration Points)
- name: summarize_frustration
type: reduce
reduce_key: split_transcript_id
associative: false
prompt: |
Summarize the customer frustration points for this support transcript:
{% for item in inputs %}
Chunk {{ item.split_transcript_chunk_num }}:
{% for indicator in item.frustration_indicators %}
- {{ indicator }}
{% endfor %}
{% endfor %}
Provide a concise summary of the main frustration points and their frequency or intensity across the entire transcript.
output:
schema:
frustration_summary: string
primary_issues: list[string]
frustration_level: string # e.g., "low", "medium", "high"
Non-Associative Reduce Operation
Note the associative: false
parameter in the reduce operation. This is crucial when the order of the chunks matters for your analysis. It ensures that the reduce operation processes the chunks in the order they appear in the original transcript, which is often important for understanding the context and progression of customer frustration.
Explanation
- The Split operation divides long transcripts into 500-token chunks.
- The Map operation analyzes each chunk for frustration indicators.
- The Reduce operation combines the frustration indicators from all chunks of a transcript, summarizing the overall frustration points, primary issues, and assessing the overall frustration level. The
associative: false
setting ensures that the chunks are processed in their original order.
This pipeline allows for detailed analysis of customer frustration in long support transcripts, which would be challenging to process in a single pass due to token limitations or degraded LLM performance on very long inputs.
Best Practices
-
Choose the Right Splitting Method: Use the token count method when working with models that have strict token limits. Use the delimiter method when you need to split at logical boundaries in your text.
-
Balance Chunk Size: When using the token count method, choose a chunk size that balances between context preservation and model performance. Smaller chunks may lose context, while larger chunks may degrade model performance. The DocETL optimizer can find the chunk size that works best for your task, if you choose to use the optimizer.
-
Consider Overlap: In some cases, you might want to implement overlap between chunks to maintain context. This isn't built into the Split operation, but you can achieve it by post-processing the split chunks.
-
Use Appropriate Delimiters: When using the delimiter method, choose a delimiter that logically divides your text. Common choices include double newlines for paragraphs, or custom markers for document sections. When using the delimiter method, adjust the
num_splits_to_group
parameter to create chunks that contain an appropriate amount of context for your task. -
Mind the Order: If the order of chunks matters for your analysis, always set
associative: false
in your subsequent reduce operations. -
Optimize for Performance: For very large documents, consider using a combination of delimiter and token count methods. First split into large sections using delimiters, then apply token count splitting to ensure no chunk exceeds model limits.
By leveraging the Split operation effectively, you can process large documents efficiently and extract meaningful insights using subsequent map and reduce operations.