Filter Operation
The Filter operation behaves like Map, except items whose boolean output evaluates to false are dropped from the dataset.
flowchart LR
d1["doc 1"] --> f{"keep?"}
d2["doc 2"] --> f
d3["doc 3"] --> f
dn["..."] --> f
f --> o1["doc 1"]
f --> o3["doc 3"]
f --> on["..."]
Example: Filtering High-Impact News Articles
- name: filter_high_impact_articles
type: filter
prompt: |
Analyze the following news article:
Title: "{{ input.title }}"
Content: "{{ input.content }}"
Determine if this article is high-impact based on the following criteria:
1. Covers a significant global or national event
2. Has potential long-term consequences
3. Affects a large number of people
4. Is from a reputable source
Respond with 'true' if the article meets at least 3 of these criteria, otherwise respond with 'false'.
output:
schema:
is_high_impact: boolean
model: gpt-4-turbo
validate:
- isinstance(output["is_high_impact"], bool)
import docetl
docetl.default_model = "gpt-4-turbo"
frame = docetl.read_json("articles.json")
frame = frame.filter(
prompt="""Analyze the following news article:
Title: "{{ input.title }}"
Content: "{{ input.content }}"
Determine if this article is high-impact based on the following criteria:
1. Covers a significant global or national event
2. Has potential long-term consequences
3. Affects a large number of people
4. Is from a reputable source
Respond with 'true' if the article meets at least 3 of these criteria, otherwise respond with 'false'.""",
output={"schema": {"is_high_impact": "boolean"}},
model="gpt-4-turbo",
validate=["isinstance(output['is_high_impact'], bool)"],
)
rows = frame.collect()
Sample Input and Output
Input:
[
{
"title": "Global Climate Summit Reaches Landmark Agreement",
"content": "In a historic move, world leaders at the Global Climate Summit have unanimously agreed to reduce carbon emissions by 50% by 2030. This unprecedented agreement involves all major economies and sets binding targets for renewable energy adoption, reforestation, and industrial emissions reduction. Experts hail this as a turning point in the fight against climate change, with potential far-reaching effects on global economies, energy systems, and everyday life for billions of people."
},
{
"title": "Local Bakery Wins Best Croissant Award",
"content": "Downtown's favorite bakery, 'The Crusty Loaf', has been awarded the title of 'Best Croissant' in the annual City Food Festival. Owner Maria Garcia attributes the win to their use of imported French butter and a secret family recipe. Local food critics praise the bakery's commitment to traditional baking methods."
}
]
Output:
[
{
"title": "Global Climate Summit Reaches Landmark Agreement",
"content": "In a historic move, world leaders at the Global Climate Summit have unanimously agreed to reduce carbon emissions by 50% by 2030. This unprecedented agreement involves all major economies and sets binding targets for renewable energy adoption, reforestation, and industrial emissions reduction. Experts hail this as a turning point in the fight against climate change, with potential far-reaching effects on global economies, energy systems, and everyday life for billions of people."
}
]
Configuration
Required Parameters
name: A unique name for the operation.type: Must be set to "filter".prompt: The prompt template to use for the filtering condition. Access input variables withinput.keyname.output: Schema definition for the output from the LLM. It must include only one field, a boolean field.
Optional Parameters
See map optional parameters for additional configuration options, including batch_prompt and max_batch_size.
Tool-equipped filter agents
In Python, filter supports agent=docetl.Agent(...) just like map. The
filter agent can call tools over multiple turns, then returns the filter's
single boolean output field. DocETL drops that boolean field from kept rows.
import docetl
@docetl.tool
def has_active_legal_hold(account_id: str) -> bool:
"""Return whether an account is currently under legal hold."""
return account_id in {"acct_1042", "acct_7788", "acct_9910"}
agent = docetl.Agent(tools=[has_active_legal_hold], max_turns=5, max_tool_calls=3)
frame = frame.filter(
prompt=(
"Keep only records that require compliance review. Use "
"has_active_legal_hold for account {{ input.account_id }} and consider "
"the event text: {{ input.event_text }}"
),
output={"schema": {"keep": "bool"}},
model="azure/gpt-4o-mini",
agent=agent,
)
Agent configs are Python-only and cannot be exported to YAML. Filters with
agent= cannot currently be combined with cascade, because cascades use a
separate proxy/oracle execution path.
See the Python API reference for the full API and the Tool-Equipped Agents tutorial for an end-to-end tool-equipped pipeline.
Model Cascade (cost reduction)
A cascade block runs a cheap proxy model on all items first and only escalates uncertain cases to the expensive oracle model, with a statistical quality guarantee.
- name: is_relevant
type: filter
model: gpt-4o
prompt: "Is this document about climate policy? {{ input.text }}"
output: { schema: { keep: "bool" } }
cascade:
proxy_model: gpt-4o-mini
target: 0.95
pipeline = pipeline.filter(
name="is_relevant",
model="gpt-4o",
prompt="Is this document about climate policy? {{ input.text }}",
output={"schema": {"keep": "bool"}},
cascade={"proxy_model": "gpt-4o-mini", "target": 0.95},
)
| Parameter | Description | Default |
|---|---|---|
proxy_model |
The cheap model for the proxy pass (required) | — |
guarantee |
accuracy, precision, recall, or precision+recall |
recall |
target |
Target value for the guarantee metric, in (0, 1) (required) |
— |
delta |
Failure probability; guarantee holds w.p. 1 - delta |
0.05 |
label_budget |
Max oracle calls spent learning the threshold | 400 |
proxy_model can be a chat model (scored by logprobs) or an embedding model
like text-embedding-3-small (a logistic head is fitted on an oracle-labeled
slice of the budget — far cheaper per item for high-volume topical filters).
See Model Cascades with BARGAIN for full details, guarantee explanations, and examples.
Limiting filtered outputs
For filter, limit counts only retained documents (boolean output true). DocETL evaluates inputs until it has collected limit passing documents, then stops scheduling LLM calls — so you can request "the first N matches" without scoring the entire dataset.
Validation
For more details on validation techniques and implementation, see operators.
Best Practices
- Boolean Output: The output schema must have exactly one boolean field; write the prompt so the LLM produces a clear true/false judgment.
- Data Flow Awareness: Unlike Map, Filter reduces the size of your dataset.