Split Operation
The Split operation divides long text content into smaller chunks. Use it when documents exceed token limits or when LLM accuracy degrades on long inputs.
flowchart LR
d["long doc"] --> c1["chunk 1"]
d --> c2["chunk 2"]
d --> c3["chunk 3"]
d --> cn["..."]
Operation Example: Splitting Customer Support Transcripts
- name: split_transcript
type: split
split_key: transcript
method: token_count
method_kwargs:
num_tokens: 500
model: gpt-4o-mini
import docetl
docetl.default_model = "gpt-4o-mini"
frame = docetl.read_json("transcripts.json")
frame = frame.split(
split_key="transcript",
method="token_count",
method_kwargs={"num_tokens": 500, "model": "gpt-4o-mini"},
)
rows = frame.collect()
This splits the 'transcript' field into chunks of approximately 500 tokens (counted with the gpt-4o-mini tokenizer), producing one output item per chunk. Chunks do not overlap in content.
Configuration
Required Parameters
type: Must be set to "split".split_key: The key of the field containing the text to split.method: The method to use for splitting. Options are "delimiter" and "token_count".method_kwargs: A dictionary of keyword arguments for the splitting method.- For "delimiter" method:
delimiter(string) to use for splitting. - For "token_count" method:
num_tokens(integer) specifying the maximum number of tokens per chunk.
Optional Parameters in `method_kwargs
| Parameter | Description | Default |
|---|---|---|
model |
The language model's tokenizer to use | Falls back to default_model |
num_splits_to_group |
Number of splits to group together into one chunk (only for "delimiter" method) | 1 |
sample |
Number of samples to use for the operation | None |
Splitting Methods
Token Count Method
Splits the text into chunks of a specified number of tokens. Use it to keep each chunk within a model's token limit, or when smaller chunks give better accuracy.
Delimiter Method
Splits the text on a specified delimiter string. Use it to split at logical boundaries such as paragraphs or sections.
Delimiter Method Example
If you set the delimiter to "\n\n" (double newline) and num_splits_to_group to 3, each chunk will contain 3 paragraphs.
- name: split_by_paragraphs
type: split
split_key: document
method: delimiter
method_kwargs:
delimiter: "\n\n"
num_splits_to_group: 3
frame = frame.split(
name="split_by_paragraphs",
split_key="document",
method="delimiter",
method_kwargs={"delimiter": "\n\n"},
num_splits_to_group=3,
)
Output
The Split operation generates multiple output items for each input item:
- All original key-value pairs from the input item.
{split_key}_chunk: The content of the split chunk.{op_name}_id: A unique identifier for each original document.{op_name}_chunk_num: The sequential number of the chunk within its original document.
Use Cases
Split is typically followed by a map over each chunk and a reduce per original document:
- Analyzing Customer Frustration: split transcripts, map to identify frustration indicators per chunk, reduce to summarize per transcript.
- Document Summarization: split, map for section-wise summaries, reduce to compile an overall summary.
- Topic Extraction: split papers into sections, map to extract topics, reduce to synthesize main themes.
End-to-End Pipeline Example: Analyzing Customer Frustration
Step 1: Split Operation
- name: split_transcript
type: split
split_key: transcript
method: token_count
method_kwargs:
num_tokens: 500
model: gpt-4o-mini
import docetl
docetl.default_model = "gpt-4o-mini"
pipeline = docetl.read_json("transcripts.json")
pipeline = pipeline.split(
name="split_transcript",
split_key="transcript",
method="token_count",
method_kwargs={"num_tokens": 500, "model": "gpt-4o-mini"},
)
Step 2: Map Operation (Identify Frustration Indicators)
- name: identify_frustration
type: map
input:
- transcript_chunk
prompt: |
Analyze the following customer support transcript chunk for signs of customer frustration:
{{ input.transcript_chunk }}
Identify any indicators of frustration, such as:
1. Use of negative language
2. Repetition of issues
3. Expressions of dissatisfaction
4. Requests for escalation
Provide a list of frustration indicators found, if any.
output:
schema:
frustration_indicators: list[string]
pipeline = pipeline.map(
name="identify_frustration",
input=["transcript_chunk"],
prompt="""Analyze the following customer support transcript chunk for signs of customer frustration:
{{ input.transcript_chunk }}
Identify any indicators of frustration, such as:
1. Use of negative language
2. Repetition of issues
3. Expressions of dissatisfaction
4. Requests for escalation
Provide a list of frustration indicators found, if any.""",
output={"schema": {"frustration_indicators": "list[string]"}},
)
Step 3: Reduce Operation (Summarize Frustration Points)
- name: summarize_frustration
type: reduce
reduce_key: split_transcript_id
associative: false
prompt: |
Summarize the customer frustration points for this support transcript:
{% for item in inputs %}
Chunk {{ item.split_transcript_chunk_num }}:
{% for indicator in item.frustration_indicators %}
- {{ indicator }}
{% endfor %}
{% endfor %}
Provide a concise summary of the main frustration points and their frequency or intensity across the entire transcript.
output:
schema:
frustration_summary: string
primary_issues: list[string]
frustration_level: string # e.g., "low", "medium", "high"
pipeline = pipeline.reduce(
name="summarize_frustration",
reduce_key="split_transcript_id",
associative=False,
prompt="""Summarize the customer frustration points for this support transcript:
{% for item in inputs %}
Chunk {{ item.split_transcript_chunk_num }}:
{% for indicator in item.frustration_indicators %}
- {{ indicator }}
{% endfor %}
{% endfor %}
Provide a concise summary of the main frustration points and their frequency or intensity across the entire transcript.""",
output={
"schema": {
"frustration_summary": "string",
"primary_issues": "list[string]",
"frustration_level": "string", # e.g., "low", "medium", "high"
}
},
)
rows = pipeline.collect()
Non-Associative Reduce Operation
Note the associative: false parameter in the reduce operation. When chunk order matters, it ensures the reduce processes chunks in the order they appear in the original transcript.
Best Practices
-
Balance Chunk Size: Smaller chunks may lose context, while larger chunks may degrade model accuracy. The DocETL optimizer can find the chunk size that works best for your task.
-
Consider Overlap: Overlap between chunks isn't built into the Split operation, but you can achieve it by post-processing the split chunks.
-
Use Appropriate Delimiters: Choose a delimiter that logically divides your text (e.g., double newlines for paragraphs, custom markers for sections), and adjust
num_splits_to_groupso chunks contain enough context for your task. -
Mind the Order: If chunk order matters for your analysis, set
associative: falsein subsequent reduce operations. -
Combine Methods: For very large documents, first split into large sections using delimiters, then apply token count splitting so no chunk exceeds model limits.