Skip to content

Semantic Operations

The pandas integration provides several semantic operations through the .semantic accessor. Each operation is designed to handle specific types of transformations and analyses using LLMs.

All semantic operations return a new DataFrame that preserves the original columns and adds new columns based on the output schema. For example, if your original DataFrame has a column text and you use map with an output={"schema": {"sentiment": "str", "keywords": "list[str]"}}, the resulting DataFrame will have three columns: text, sentiment, and keywords. This makes it easy to chain operations and maintain data lineage.

Map Operation

Apply semantic mapping to each row using a language model.

Documentation: https://ucbepic.github.io/docetl/operators/map/

Parameters:

Name Type Description Default
prompt str

Jinja template string for generating prompts. Use {{input.column_name}} to reference input columns.

required
output dict[str, Any]

Dictionary containing output configuration with keys: - "schema": Dictionary defining the expected output structure and types. Example: {"entities": "list[str]", "sentiment": "str"} - "mode": Optional output mode. Either "tools" (default) or "structured_output". "structured_output" uses native JSON schema mode for supported models. - "n": Optional number of outputs to generate for each input (synthetic data generation)

None
output_schema dict[str, Any]

DEPRECATED. Use 'output' parameter instead. Dictionary defining the expected output structure for backward compatibility.

None
**kwargs

Additional configuration options: - model: LLM model to use (default: from config) - batch_prompt: Template for processing multiple documents in a single prompt - max_batch_size: Maximum number of documents to process in a single batch - optimize: Flag to enable operation optimization (default: True) - recursively_optimize: Flag to enable recursive optimization (default: False) - sample: Number of samples to use for the operation - tools: List of tool definitions for LLM use - validate: List of Python expressions to validate output - num_retries_on_validate_failure: Number of retry attempts (default: 0) - gleaning: Configuration for LLM-based refinement - drop_keys: List of keys to drop from input - timeout: Timeout for each LLM call in seconds (default: 120) - max_retries_per_timeout: Maximum retries per timeout (default: 2) - litellm_completion_kwargs: Additional parameters for LiteLLM - skip_on_error: Skip operation if LLM returns error (default: False) - bypass_cache: Bypass cache for this operation (default: False) - n: Number of outputs to generate for each input (synthetic data generation)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing the transformed data with columns matching the output schema.

Examples:

>>> # Extract entities and sentiment
>>> df.semantic.map(
...     prompt="Analyze this text: {{input.text}}",
...     output={
...         "schema": {
...             "entities": "list[str]",
...             "sentiment": "str"
...         }
...     },
...     validate=["len(output['entities']) <= 5"],
...     num_retries_on_validate_failure=2
... )
>>> # Generate synthetic data with multiple variations per input
>>> df.semantic.map(
...     prompt="Create a headline for: {{input.topic}}",
...     output={"schema": {"headline": "str"}, "n": 5}
... )
>>> # Use structured output mode for better JSON schema support
>>> df.semantic.map(
...     prompt="Extract structured data: {{input.text}}",
...     output={
...         "schema": {"name": "str", "age": "int", "tags": "list[str]"},
...         "mode": "structured_output"
...     }
... )
Source code in docetl/apis/pd_accessors.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
def map(
    self,
    prompt: str,
    output: dict[str, Any] = None,
    *,
    output_schema: dict[str, Any] = None,
    **kwargs,
) -> pd.DataFrame:
    """
    Apply semantic mapping to each row using a language model.

    Documentation: https://ucbepic.github.io/docetl/operators/map/

    Args:
        prompt: Jinja template string for generating prompts. Use {{input.column_name}}
               to reference input columns.
        output: Dictionary containing output configuration with keys:
               - "schema": Dictionary defining the expected output structure and types.
                          Example: {"entities": "list[str]", "sentiment": "str"}
               - "mode": Optional output mode. Either "tools" (default) or "structured_output".
                        "structured_output" uses native JSON schema mode for supported models.
               - "n": Optional number of outputs to generate for each input (synthetic data generation)
        output_schema: DEPRECATED. Use 'output' parameter instead.
                      Dictionary defining the expected output structure for backward compatibility.
        **kwargs: Additional configuration options:
            - model: LLM model to use (default: from config)
            - batch_prompt: Template for processing multiple documents in a single prompt
            - max_batch_size: Maximum number of documents to process in a single batch
            - optimize: Flag to enable operation optimization (default: True)
            - recursively_optimize: Flag to enable recursive optimization (default: False)
            - sample: Number of samples to use for the operation
            - tools: List of tool definitions for LLM use
            - validate: List of Python expressions to validate output
            - num_retries_on_validate_failure: Number of retry attempts (default: 0)
            - gleaning: Configuration for LLM-based refinement
            - drop_keys: List of keys to drop from input
            - timeout: Timeout for each LLM call in seconds (default: 120)
            - max_retries_per_timeout: Maximum retries per timeout (default: 2)
            - litellm_completion_kwargs: Additional parameters for LiteLLM
            - skip_on_error: Skip operation if LLM returns error (default: False)
            - bypass_cache: Bypass cache for this operation (default: False)
            - n: Number of outputs to generate for each input (synthetic data generation)

    Returns:
        pd.DataFrame: A new DataFrame containing the transformed data with columns
                     matching the output schema.

    Examples:
        >>> # Extract entities and sentiment
        >>> df.semantic.map(
        ...     prompt="Analyze this text: {{input.text}}",
        ...     output={
        ...         "schema": {
        ...             "entities": "list[str]",
        ...             "sentiment": "str"
        ...         }
        ...     },
        ...     validate=["len(output['entities']) <= 5"],
        ...     num_retries_on_validate_failure=2
        ... )

        >>> # Generate synthetic data with multiple variations per input
        >>> df.semantic.map(
        ...     prompt="Create a headline for: {{input.topic}}",
        ...     output={"schema": {"headline": "str"}, "n": 5}
        ... )

        >>> # Use structured output mode for better JSON schema support
        >>> df.semantic.map(
        ...     prompt="Extract structured data: {{input.text}}",
        ...     output={
        ...         "schema": {"name": "str", "age": "int", "tags": "list[str]"},
        ...         "mode": "structured_output"
        ...     }
        ... )
    """
    # Convert DataFrame to list of dicts for DocETL
    input_data = self._df.to_dict("records")

    # Handle backward compatibility: if output_schema is provided, convert it to output format
    if output_schema is not None and output is None:
        output = {"schema": output_schema}
        if "n" in kwargs:
            output["n"] = kwargs.pop("n")
    elif output is None and output_schema is None:
        raise ValueError("Either 'output' or 'output_schema' must be provided")
    elif output is not None and output_schema is not None:
        raise ValueError(
            "Cannot provide both 'output' and 'output_schema' parameters"
        )

    # Validate output parameter
    if not isinstance(output, dict):
        raise ValueError("output must be a dictionary")

    if "schema" not in output:
        raise ValueError("output must contain a 'schema' key")

    # Validate output mode if provided
    output_mode = output.get("mode", "tools")
    if output_mode not in ["tools", "structured_output"]:
        raise ValueError(
            f"output mode must be 'tools' or 'structured_output', got '{output_mode}'"
        )

    # Create map operation config
    map_config = {
        "type": "map",
        "name": f"semantic_map_{len(self._history)}",
        "prompt": prompt,
        "output": output,
        **kwargs,
    }

    # Create and execute map operation
    map_op = MapOperation(
        runner=self.runner,
        config=map_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = map_op.execute(input_data)

    return self._record_operation(results, "map", map_config, cost)

Example usage:

# Basic map operation
df.semantic.map(
    prompt="Extract sentiment and key points from: {{input.text}}",
    output={
        "schema": {
            "sentiment": "str",
            "key_points": "list[str]"
        }
    },
    validate=["len(output['key_points']) <= 5"],
    num_retries_on_validate_failure=2
)

# Using structured output mode for better JSON schema support
df.semantic.map(
    prompt="Extract detailed information from: {{input.text}}",
    output={
        "schema": {
            "company": "str",
            "product": "str",
            "features": "list[str]"
        },
        "mode": "structured_output"
    }
)

# Backward compatible syntax (still supported)
df.semantic.map(
    prompt="Extract sentiment from: {{input.text}}",
    output_schema={"sentiment": "str"}
)

Filter Operation

Filter DataFrame rows based on semantic conditions.

Documentation: https://ucbepic.github.io/docetl/operators/filter/

Parameters:

Name Type Description Default
prompt str

Jinja template string for generating prompts

required
output_schema dict[str, Any] | None

Optional custom output schema. If None, defaults to

None
**kwargs

Additional configuration options: - model: LLM model to use - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2) - skip_on_error: Skip rows on LLM error (default: False) - bypass_cache: Bypass cache for this operation (default: False)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: Filtered DataFrame containing only rows where the model returned True

Examples:

>>> # Simple filtering
>>> df.semantic.filter(
...     prompt="Is this about technology? {{input.text}}"
... )
>>> # Custom output schema
>>> df.semantic.filter(
...     prompt="Analyze if this is relevant: {{input.text}}",
...     output_schema={
...         "keep": "bool",
...         "reason": "str"
...     }
... )
Source code in docetl/apis/pd_accessors.py
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
def filter(
    self, prompt: str, *, output_schema: dict[str, Any] | None = None, **kwargs
) -> pd.DataFrame:
    """
    Filter DataFrame rows based on semantic conditions.

    Documentation: https://ucbepic.github.io/docetl/operators/filter/

    Args:
        prompt: Jinja template string for generating prompts
        output_schema: Optional custom output schema. If None, defaults to
                      {"keep": "bool"}
        **kwargs: Additional configuration options:
            - model: LLM model to use
            - validate: List of validation expressions
            - num_retries_on_validate_failure: Number of retries
            - timeout: Timeout in seconds (default: 120)
            - max_retries_per_timeout: Max retries per timeout (default: 2)
            - skip_on_error: Skip rows on LLM error (default: False)
            - bypass_cache: Bypass cache for this operation (default: False)

    Returns:
        pd.DataFrame: Filtered DataFrame containing only rows where the model
                     returned True

    Examples:
        >>> # Simple filtering
        >>> df.semantic.filter(
        ...     prompt="Is this about technology? {{input.text}}"
        ... )

        >>> # Custom output schema
        >>> df.semantic.filter(
        ...     prompt="Analyze if this is relevant: {{input.text}}",
        ...     output_schema={
        ...         "keep": "bool",
        ...         "reason": "str"
        ...     }
        ... )
    """
    # Convert DataFrame to list of dicts
    input_data = self._df.to_dict("records")

    # Create map operation config for filtering
    filter_config = {
        "type": "map",
        "name": f"semantic_filter_{len(self._history)}",
        "prompt": prompt,
        "output": (
            {"schema": {"keep": "bool"}} if output_schema is None else output_schema
        ),
        **kwargs,
    }

    # Create and execute filter operation
    filter_op = FilterOperation(
        runner=self.runner,
        config=filter_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = filter_op.execute(input_data)

    return self._record_operation(results, "filter", filter_config, cost)

Example usage:

# Simple filtering
df.semantic.filter(
    prompt="Is this text about technology? {{input.text}}"
)

# Custom output schema with reasons
df.semantic.filter(
    prompt="Analyze if this is relevant: {{input.text}}",
    output={
        "schema": {
            "keep": "bool",
            "reason": "str"
        }
    }
)

Merge Operation (Experimental)

Note: The merge operation is an experimental feature based on our equijoin operator. It provides a pandas-like interface for semantic record matching and deduplication. When fuzzy=True, it automatically invokes optimization to improve performance while maintaining accuracy.

Semantically merge two DataFrames based on flexible matching criteria.

Documentation: https://ucbepic.github.io/docetl/operators/equijoin/

When fuzzy=True and no blocking parameters are provided, this method automatically invokes the JoinOptimizer to generate efficient blocking conditions. The optimizer will suggest blocking thresholds and conditions to improve performance while maintaining the target recall. The optimized configuration will be displayed for future reuse.

Parameters:

Name Type Description Default
right DataFrame

Right DataFrame to merge with

required
comparison_prompt str

Prompt template for comparing records

required
fuzzy bool

Whether to use fuzzy matching with optimization (default: False)

False
**kwargs

Additional configuration options: - model: LLM model to use - blocking_threshold: Threshold for blocking optimization - blocking_conditions: Custom blocking conditions - target_recall: Target recall for optimization (default: 0.95) - estimated_selectivity: Estimated match rate - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: Merged DataFrame containing matched records

Examples:

>>> # Simple merge
>>> merged_df = df1.semantic.merge(
...     df2,
...     comparison_prompt="Are these records about the same entity? {{input1}} vs {{input2}}"
... )
>>> # Fuzzy merge with automatic optimization
>>> merged_df = df1.semantic.merge(
...     df2,
...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
...     fuzzy=True,
...     target_recall=0.9
... )
>>> # Fuzzy merge with manual blocking parameters
>>> merged_df = df1.semantic.merge(
...     df2,
...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
...     fuzzy=False,
...     blocking_threshold=0.8,
...     blocking_conditions=["input1.category == input2.category"]
... )
Source code in docetl/apis/pd_accessors.py
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
def merge(
    self,
    right: pd.DataFrame,
    comparison_prompt: str,
    *,
    fuzzy: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Semantically merge two DataFrames based on flexible matching criteria.

    Documentation: https://ucbepic.github.io/docetl/operators/equijoin/

    When fuzzy=True and no blocking parameters are provided, this method automatically
    invokes the JoinOptimizer to generate efficient blocking conditions. The optimizer
    will suggest blocking thresholds and conditions to improve performance while
    maintaining the target recall. The optimized configuration will be displayed
    for future reuse.

    Args:
        right: Right DataFrame to merge with
        comparison_prompt: Prompt template for comparing records
        fuzzy: Whether to use fuzzy matching with optimization (default: False)
        **kwargs: Additional configuration options:
            - model: LLM model to use
            - blocking_threshold: Threshold for blocking optimization
            - blocking_conditions: Custom blocking conditions
            - target_recall: Target recall for optimization (default: 0.95)
            - estimated_selectivity: Estimated match rate
            - validate: List of validation expressions
            - num_retries_on_validate_failure: Number of retries
            - timeout: Timeout in seconds (default: 120)
            - max_retries_per_timeout: Max retries per timeout (default: 2)

    Returns:
        pd.DataFrame: Merged DataFrame containing matched records

    Examples:
        >>> # Simple merge
        >>> merged_df = df1.semantic.merge(
        ...     df2,
        ...     comparison_prompt="Are these records about the same entity? {{input1}} vs {{input2}}"
        ... )

        >>> # Fuzzy merge with automatic optimization
        >>> merged_df = df1.semantic.merge(
        ...     df2,
        ...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
        ...     fuzzy=True,
        ...     target_recall=0.9
        ... )

        >>> # Fuzzy merge with manual blocking parameters
        >>> merged_df = df1.semantic.merge(
        ...     df2,
        ...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
        ...     fuzzy=False,
        ...     blocking_threshold=0.8,
        ...     blocking_conditions=["input1.category == input2.category"]
        ... )
    """
    # Convert DataFrames to lists of dicts
    left_data = self._df.to_dict("records")
    right_data = right.to_dict("records")

    # Create equijoin operation config
    join_config = {
        "type": "equijoin",
        "name": f"semantic_merge_{len(self._history)}",
        "comparison_prompt": comparison_prompt,
        **kwargs,
    }

    # If fuzzy matching and no blocking params provided, use JoinOptimizer
    if (
        fuzzy
        and not kwargs.get("blocking_threshold")
        and not kwargs.get("blocking_conditions")
    ):
        join_optimizer = JoinOptimizer(
            self.runner,
            join_config,
            target_recall=kwargs.get("target_recall", 0.95),
            estimated_selectivity=kwargs.get("estimated_selectivity", None),
        )
        optimized_config, optimizer_cost, _ = join_optimizer.optimize_equijoin(
            left_data, right_data, skip_map_gen=True, skip_containment_gen=True
        )

        # Print optimized config for reuse
        self.runner.console.log(
            Panel.fit(
                "[bold cyan]Optimized Configuration for Merge[/bold cyan]\n"
                "[yellow]Consider adding these parameters to avoid re-running optimization:[/yellow]\n\n"
                f"blocking_threshold: {optimized_config.get('blocking_threshold')}\n"
                f"blocking_keys: {optimized_config.get('blocking_keys')}\n"
                f"blocking_conditions: {optimized_config.get('blocking_conditions', [])}\n",
                title="Optimization Results",
            )
        )
        join_config = optimized_config
        optimizer_cost_value = optimizer_cost
    else:
        optimizer_cost_value = 0.0

    # Create and execute equijoin operation
    join_op = EquijoinOperation(
        runner=self.runner,
        config=join_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = join_op.execute(left_data, right_data)

    return self._record_operation(
        results, "equijoin", join_config, cost + optimizer_cost_value
    )

Example usage:

# Simple merge
merged_df = df1.semantic.merge(
    df2,
    comparison_prompt="Are these records about the same entity? {{input1}} vs {{input2}}"
)

# Fuzzy merge with optimization
merged_df = df1.semantic.merge(
    df2,
    comparison_prompt="Compare: {{input1}} vs {{input2}}",
    fuzzy=True,
    target_recall=0.9
)

Aggregate Operation

Semantically aggregate data with optional fuzzy matching.

Documentation: - Resolve Operation: https://ucbepic.github.io/docetl/operators/resolve/ - Reduce Operation: https://ucbepic.github.io/docetl/operators/reduce/

When fuzzy=True and no blocking parameters are provided in resolve_kwargs, this method automatically invokes the JoinOptimizer to generate efficient blocking conditions for the resolve phase. The optimizer will suggest blocking thresholds and conditions to improve performance while maintaining the target recall. The optimized configuration will be displayed for future reuse.

The resolve phase is skipped if: - fuzzy=False - reduce_keys=["_all"] - input data has 5 or fewer rows

Parameters:

Name Type Description Default
reduce_prompt str

Prompt template for the reduction phase

required
output_schema dict[str, Any]

Schema for the final output

required
fuzzy bool

Whether to use fuzzy matching for resolution (default: False)

False
comparison_prompt str | None

Prompt template for comparing records during resolution

None
resolution_prompt str | None

Prompt template for resolving conflicts

None
resolution_output_schema dict[str, Any] | None

Schema for resolution output

None
reduce_keys str | list[str]

Keys to group by for reduction (default: ["_all"])

['_all']
resolve_kwargs dict[str, Any]

Additional kwargs for resolve operation: - model: LLM model to use - blocking_threshold: Threshold for blocking optimization - blocking_conditions: Custom blocking conditions - target_recall: Target recall for optimization (default: 0.95) - estimated_selectivity: Estimated match rate - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2)

{}
reduce_kwargs dict[str, Any]

Additional kwargs for reduce operation: - model: LLM model to use - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: Aggregated DataFrame with columns matching output_schema

Examples:

>>> # Simple aggregation
>>> df.semantic.agg(
...     reduce_prompt="Summarize these items: {{input.text}}",
...     output_schema={"summary": "str"}
... )
>>> # Fuzzy matching with automatic optimization
>>> df.semantic.agg(
...     reduce_prompt="Combine these items: {{input.text}}",
...     output_schema={"combined": "str"},
...     fuzzy=True,
...     comparison_prompt="Are these items similar: {{input1.text}} vs {{input2.text}}",
...     resolution_prompt="Resolve conflicts between: {{items}}",
...     resolution_output_schema={"resolved": "str"}
... )
>>> # Fuzzy matching with manual blocking parameters
>>> df.semantic.agg(
...     reduce_prompt="Combine these items: {{input.text}}",
...     output_schema={"combined": "str"},
...     fuzzy=False,
...     comparison_prompt="Compare items: {{input1.text}} vs {{input2.text}}",
...     resolve_kwargs={
...         "blocking_threshold": 0.8,
...         "blocking_conditions": ["input1.category == input2.category"]
...     }
... )
Source code in docetl/apis/pd_accessors.py
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
    def agg(
        self,
        *,
        # Reduction phase params (required)
        reduce_prompt: str,
        output_schema: dict[str, Any],
        # Resolution and reduce phase params (optional)
        fuzzy: bool = False,
        comparison_prompt: str | None = None,
        resolution_prompt: str | None = None,
        resolution_output_schema: dict[str, Any] | None = None,
        reduce_keys: str | list[str] = ["_all"],
        resolve_kwargs: dict[str, Any] = {},
        reduce_kwargs: dict[str, Any] = {},
    ) -> pd.DataFrame:
        """
        Semantically aggregate data with optional fuzzy matching.

        Documentation:
        - Resolve Operation: https://ucbepic.github.io/docetl/operators/resolve/
        - Reduce Operation: https://ucbepic.github.io/docetl/operators/reduce/

        When fuzzy=True and no blocking parameters are provided in resolve_kwargs,
        this method automatically invokes the JoinOptimizer to generate efficient
        blocking conditions for the resolve phase. The optimizer will suggest
        blocking thresholds and conditions to improve performance while maintaining
        the target recall. The optimized configuration will be displayed for future reuse.

        The resolve phase is skipped if:
        - fuzzy=False
        - reduce_keys=["_all"]
        - input data has 5 or fewer rows

        Args:
            reduce_prompt: Prompt template for the reduction phase
            output_schema: Schema for the final output
            fuzzy: Whether to use fuzzy matching for resolution (default: False)
            comparison_prompt: Prompt template for comparing records during resolution
            resolution_prompt: Prompt template for resolving conflicts
            resolution_output_schema: Schema for resolution output
            reduce_keys: Keys to group by for reduction (default: ["_all"])
            resolve_kwargs: Additional kwargs for resolve operation:
                - model: LLM model to use
                - blocking_threshold: Threshold for blocking optimization
                - blocking_conditions: Custom blocking conditions
                - target_recall: Target recall for optimization (default: 0.95)
                - estimated_selectivity: Estimated match rate
                - validate: List of validation expressions
                - num_retries_on_validate_failure: Number of retries
                - timeout: Timeout in seconds (default: 120)
                - max_retries_per_timeout: Max retries per timeout (default: 2)
            reduce_kwargs: Additional kwargs for reduce operation:
                - model: LLM model to use
                - validate: List of validation expressions
                - num_retries_on_validate_failure: Number of retries
                - timeout: Timeout in seconds (default: 120)
                - max_retries_per_timeout: Max retries per timeout (default: 2)

        Returns:
            pd.DataFrame: Aggregated DataFrame with columns matching output_schema

        Examples:
            >>> # Simple aggregation
            >>> df.semantic.agg(
            ...     reduce_prompt="Summarize these items: {{input.text}}",
            ...     output_schema={"summary": "str"}
            ... )

            >>> # Fuzzy matching with automatic optimization
            >>> df.semantic.agg(
            ...     reduce_prompt="Combine these items: {{input.text}}",
            ...     output_schema={"combined": "str"},
            ...     fuzzy=True,
            ...     comparison_prompt="Are these items similar: {{input1.text}} vs {{input2.text}}",
            ...     resolution_prompt="Resolve conflicts between: {{items}}",
            ...     resolution_output_schema={"resolved": "str"}
            ... )

            >>> # Fuzzy matching with manual blocking parameters
            >>> df.semantic.agg(
            ...     reduce_prompt="Combine these items: {{input.text}}",
            ...     output_schema={"combined": "str"},
            ...     fuzzy=False,
            ...     comparison_prompt="Compare items: {{input1.text}} vs {{input2.text}}",
            ...     resolve_kwargs={
            ...         "blocking_threshold": 0.8,
            ...         "blocking_conditions": ["input1.category == input2.category"]
            ...     }
            ... )
        """
        input_data = self._df.to_dict("records")

        # Change keys to list
        if isinstance(reduce_keys, str):
            reduce_keys = [reduce_keys]

        # Skip resolution if using _all or fuzzy is False
        if reduce_keys == ["_all"] or not fuzzy or len(input_data) <= 5:
            resolved_data = input_data
        else:
            # Synthesize comparison prompt if not provided
            if comparison_prompt is None:
                # Build record template from reduce_keys
                record_template = ", ".join(
                    f"{key}: {{{{ input{0}.{key} }}}}" for key in reduce_keys
                )

                # Add context about how fields were created
                context = self._synthesize_comparison_context(reduce_keys)

                comparison_prompt = f"""Do the following two records represent the same concept? Your answer should be true or false.{context}

Record 1: {record_template.replace('input0', 'input1')}
Record 2: {record_template.replace('input0', 'input2')}"""

            # Configure resolution
            resolve_config = {
                "type": "resolve",
                "name": f"semantic_resolve_{len(self._history)}",
                "comparison_prompt": comparison_prompt,
                **resolve_kwargs,
            }

            # Add resolution prompt and schema if provided
            if resolution_prompt:
                resolve_config["resolution_prompt"] = resolution_prompt
                resolve_config["output"] = {
                    "schema": resolution_output_schema,
                    "keys": list(resolution_output_schema.keys()),
                }
            else:
                resolve_config["output"] = {"keys": reduce_keys}

            # If blocking params not provided, use JoinOptimizer to synthesize them
            if not resolve_kwargs or (
                "blocking_threshold" not in resolve_kwargs
                and "blocking_conditions" not in resolve_kwargs
            ):
                join_optimizer = JoinOptimizer(
                    self.runner,
                    resolve_config,
                    target_recall=(
                        resolve_kwargs.get("target_recall", 0.95)
                        if resolve_kwargs
                        else 0.95
                    ),
                    estimated_selectivity=(
                        resolve_kwargs.get("estimated_selectivity", None)
                        if resolve_kwargs
                        else None
                    ),
                )
                optimized_config, optimizer_cost = join_optimizer.optimize_resolve(
                    input_data
                )

                # Print optimized config for reuse
                self.runner.console.log(
                    Panel.fit(
                        "[bold cyan]Optimized Configuration for Resolve[/bold cyan]\n"
                        "[yellow]Consider adding these parameters to avoid re-running optimization:[/yellow]\n\n"
                        f"blocking_threshold: {optimized_config.get('blocking_threshold')}\n"
                        f"blocking_keys: {optimized_config.get('blocking_keys')}\n"
                        f"blocking_conditions: {optimized_config.get('blocking_conditions', [])}\n",
                        title="Optimization Results",
                    )
                )
            else:
                # Use provided blocking params
                optimized_config = resolve_config.copy()
                optimizer_cost = 0.0

            # Execute resolution with optimized config
            resolve_op = ResolveOperation(
                runner=self.runner,
                config=optimized_config,
                default_model=self.runner.config["default_model"],
                max_threads=self.runner.max_threads,
                console=self.runner.console,
                status=self.runner.status,
            )
            resolved_data, resolve_cost = resolve_op.execute(input_data)
            _ = self._record_operation(
                resolved_data,
                "resolve",
                optimized_config,
                resolve_cost + optimizer_cost,
            )

        # Configure reduction
        reduce_config = {
            "type": "reduce",
            "name": f"semantic_reduce_{len(self._history)}",
            "reduce_key": reduce_keys,
            "prompt": reduce_prompt,
            "output": {"schema": output_schema},
            **reduce_kwargs,
        }

        # Execute reduction
        reduce_op = ReduceOperation(
            runner=self.runner,
            config=reduce_config,
            default_model=self.runner.config["default_model"],
            max_threads=self.runner.max_threads,
            console=self.runner.console,
            status=self.runner.status,
        )
        results, reduce_cost = reduce_op.execute(resolved_data)

        return self._record_operation(results, "reduce", reduce_config, reduce_cost)

Example usage:

# Simple aggregation
df.semantic.agg(
    reduce_prompt="Summarize these items: {{input.text}}",
    output_schema={"summary": "str"}
)

# Fuzzy matching with custom resolution
df.semantic.agg(
    reduce_prompt="Combine these items: {{input.text}}",
    output_schema={"combined": "str"},
    fuzzy=True,
    comparison_prompt="Are these items similar: {{input1.text}} vs {{input2.text}}",
    resolution_prompt="Resolve conflicts between: {{items}}",
    resolution_output_schema={"resolved": "str"}
)

Split Operation

    Split DataFrame rows into multiple chunks based on content.

    Documentation: https://ucbepic.github.io/docetl/operators/split/

    Args:
        split_key: The column containing content to split
        method: Splitting method, either "token_count" or "delimiter"
        method_kwargs: Dictionary containing method-specific parameters:
            - For "token_count": {"num_tokens": int, "model": str (optional)}
            - For "delimiter": {"delimiter": str, "num_splits_to_group": int (optional)}
        **kwargs: Additional configuration options:
            - model: LLM model to use for tokenization (default: from config)

    Returns:
        pd.DataFrame: DataFrame with split content, including:
            - {split_key}_chunk: The content of each chunk
            - {operation_name}_id: Unique identifier for the original document
            - {operation_name}_chunk_num: Sequential chunk number within the document

    Examples:
        >>> # Split by token count
        >>> df.semantic.split(
        ...     split_key="content",
        ...     method="token_count",
        ...     method_kwargs={"num_tokens": 100}
        ... )

        >>> # Split by delimiter
        >>> df.semantic.split(
        ...     split_key="text",
        ...     method="delimiter",
        ...     method_kwargs={"delimiter": "

", "num_splits_to_group": 2} ... )

Source code in docetl/apis/pd_accessors.py
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
def split(
    self, split_key: str, method: str, method_kwargs: dict[str, Any], **kwargs
) -> pd.DataFrame:
    """
    Split DataFrame rows into multiple chunks based on content.

    Documentation: https://ucbepic.github.io/docetl/operators/split/

    Args:
        split_key: The column containing content to split
        method: Splitting method, either "token_count" or "delimiter"
        method_kwargs: Dictionary containing method-specific parameters:
            - For "token_count": {"num_tokens": int, "model": str (optional)}
            - For "delimiter": {"delimiter": str, "num_splits_to_group": int (optional)}
        **kwargs: Additional configuration options:
            - model: LLM model to use for tokenization (default: from config)

    Returns:
        pd.DataFrame: DataFrame with split content, including:
            - {split_key}_chunk: The content of each chunk
            - {operation_name}_id: Unique identifier for the original document
            - {operation_name}_chunk_num: Sequential chunk number within the document

    Examples:
        >>> # Split by token count
        >>> df.semantic.split(
        ...     split_key="content",
        ...     method="token_count",
        ...     method_kwargs={"num_tokens": 100}
        ... )

        >>> # Split by delimiter
        >>> df.semantic.split(
        ...     split_key="text",
        ...     method="delimiter",
        ...     method_kwargs={"delimiter": "\n\n", "num_splits_to_group": 2}
        ... )
    """
    # Convert DataFrame to list of dicts
    input_data = self._df.to_dict("records")

    # Create split operation config
    split_config = {
        "type": "split",
        "name": f"semantic_split_{len(self._history)}",
        "split_key": split_key,
        "method": method,
        "method_kwargs": method_kwargs,
        **kwargs,
    }

    # Create and execute split operation
    split_op = SplitOperation(
        runner=self.runner,
        config=split_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = split_op.execute(input_data)

    return self._record_operation(results, "split", split_config, cost)

Example usage:

# Split by token count
df.semantic.split(
    split_key="content",
    method="token_count",
    method_kwargs={"num_tokens": 100}
)

# Split by delimiter
df.semantic.split(
    split_key="text",
    method="delimiter",
    method_kwargs={"delimiter": "\n\n", "num_splits_to_group": 2}
)

Gather Operation

Gather contextual information from surrounding chunks to enhance each chunk.

Documentation: https://ucbepic.github.io/docetl/operators/gather/

Parameters:

Name Type Description Default
content_key str

The column containing the main content to be enhanced

required
doc_id_key str

The column containing document identifiers to group chunks

required
order_key str

The column containing chunk order numbers within documents

required
peripheral_chunks dict[str, Any] | None

Configuration for surrounding context: - previous: {"head": {"count": int}, "tail": {"count": int}, "middle": {}} - next: {"head": {"count": int}, "tail": {"count": int}, "middle": {}}

None
**kwargs

Additional configuration options: - main_chunk_start: Start marker for main chunk (default: "--- Begin Main Chunk ---") - main_chunk_end: End marker for main chunk (default: "--- End Main Chunk ---") - doc_header_key: Column containing document headers (optional)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with enhanced content including: - {content_key}_rendered: The main content with surrounding context

Examples:

>>> # Basic gathering with surrounding context
>>> df.semantic.gather(
...     content_key="chunk_content",
...     doc_id_key="document_id",
...     order_key="chunk_number",
...     peripheral_chunks={
...         "previous": {"head": {"count": 2}, "tail": {"count": 1}},
...         "next": {"head": {"count": 1}, "tail": {"count": 2}}
...     }
... )
>>> # Simple gathering without peripheral chunks
>>> df.semantic.gather(
...     content_key="content",
...     doc_id_key="doc_id",
...     order_key="order"
... )
Source code in docetl/apis/pd_accessors.py
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
def gather(
    self,
    content_key: str,
    doc_id_key: str,
    order_key: str,
    peripheral_chunks: dict[str, Any] | None = None,
    **kwargs,
) -> pd.DataFrame:
    """
    Gather contextual information from surrounding chunks to enhance each chunk.

    Documentation: https://ucbepic.github.io/docetl/operators/gather/

    Args:
        content_key: The column containing the main content to be enhanced
        doc_id_key: The column containing document identifiers to group chunks
        order_key: The column containing chunk order numbers within documents
        peripheral_chunks: Configuration for surrounding context:
            - previous: {"head": {"count": int}, "tail": {"count": int}, "middle": {}}
            - next: {"head": {"count": int}, "tail": {"count": int}, "middle": {}}
        **kwargs: Additional configuration options:
            - main_chunk_start: Start marker for main chunk (default: "--- Begin Main Chunk ---")
            - main_chunk_end: End marker for main chunk (default: "--- End Main Chunk ---")
            - doc_header_key: Column containing document headers (optional)

    Returns:
        pd.DataFrame: DataFrame with enhanced content including:
            - {content_key}_rendered: The main content with surrounding context

    Examples:
        >>> # Basic gathering with surrounding context
        >>> df.semantic.gather(
        ...     content_key="chunk_content",
        ...     doc_id_key="document_id",
        ...     order_key="chunk_number",
        ...     peripheral_chunks={
        ...         "previous": {"head": {"count": 2}, "tail": {"count": 1}},
        ...         "next": {"head": {"count": 1}, "tail": {"count": 2}}
        ...     }
        ... )

        >>> # Simple gathering without peripheral chunks
        >>> df.semantic.gather(
        ...     content_key="content",
        ...     doc_id_key="doc_id",
        ...     order_key="order"
        ... )
    """
    # Convert DataFrame to list of dicts
    input_data = self._df.to_dict("records")

    # Create gather operation config
    gather_config = {
        "type": "gather",
        "name": f"semantic_gather_{len(self._history)}",
        "content_key": content_key,
        "doc_id_key": doc_id_key,
        "order_key": order_key,
        **kwargs,
    }

    # Add peripheral_chunks config if provided
    if peripheral_chunks is not None:
        gather_config["peripheral_chunks"] = peripheral_chunks

    # Create and execute gather operation
    gather_op = GatherOperation(
        runner=self.runner,
        config=gather_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = gather_op.execute(input_data)

    return self._record_operation(results, "gather", gather_config, cost)

Example usage:

# Basic gathering with surrounding context
df.semantic.gather(
    content_key="chunk_content",
    doc_id_key="document_id",
    order_key="chunk_number",
    peripheral_chunks={
        "previous": {"head": {"count": 2}, "tail": {"count": 1}},
        "next": {"head": {"count": 1}, "tail": {"count": 2}}
    }
)

# Simple gathering without peripheral chunks
df.semantic.gather(
    content_key="content",
    doc_id_key="doc_id",
    order_key="order"
)

Unnest Operation

Unnest list-like or dictionary values into multiple rows.

Documentation: https://ucbepic.github.io/docetl/operators/unnest/

Parameters:

Name Type Description Default
unnest_key str

The column containing list-like or dictionary values to unnest

required
keep_empty bool

Whether to keep rows with empty/null values (default: False)

False
expand_fields list[str] | None

For dictionary values, which fields to expand (default: all)

None
recursive bool

Whether to recursively unnest nested structures (default: False)

False
depth int | None

Maximum depth for recursive unnesting (default: 1, or unlimited if recursive=True)

None
**kwargs

Additional configuration options

{}

Returns:

Type Description
DataFrame

pd.DataFrame: DataFrame with unnested values, where: - For lists: Each list element becomes a separate row - For dicts: Specified fields are expanded into the parent row

Examples:

>>> # Unnest a list column
>>> df.semantic.unnest(
...     unnest_key="tags"
... )
# Input:  [{"id": 1, "tags": ["a", "b"]}]
# Output: [{"id": 1, "tags": "a"}, {"id": 1, "tags": "b"}]
>>> # Unnest a dictionary column with specific fields
>>> df.semantic.unnest(
...     unnest_key="user_info",
...     expand_fields=["name", "age"]
... )
# Input:  [{"id": 1, "user_info": {"name": "Alice", "age": 30, "email": "alice@example.com"}}]
# Output: [{"id": 1, "user_info": {...}, "name": "Alice", "age": 30}]
>>> # Recursive unnesting
>>> df.semantic.unnest(
...     unnest_key="nested_lists",
...     recursive=True,
...     depth=2
... )
Source code in docetl/apis/pd_accessors.py
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
def unnest(
    self,
    unnest_key: str,
    keep_empty: bool = False,
    expand_fields: list[str] | None = None,
    recursive: bool = False,
    depth: int | None = None,
    **kwargs,
) -> pd.DataFrame:
    """
    Unnest list-like or dictionary values into multiple rows.

    Documentation: https://ucbepic.github.io/docetl/operators/unnest/

    Args:
        unnest_key: The column containing list-like or dictionary values to unnest
        keep_empty: Whether to keep rows with empty/null values (default: False)
        expand_fields: For dictionary values, which fields to expand (default: all)
        recursive: Whether to recursively unnest nested structures (default: False)
        depth: Maximum depth for recursive unnesting (default: 1, or unlimited if recursive=True)
        **kwargs: Additional configuration options

    Returns:
        pd.DataFrame: DataFrame with unnested values, where:
            - For lists: Each list element becomes a separate row
            - For dicts: Specified fields are expanded into the parent row

    Examples:
        >>> # Unnest a list column
        >>> df.semantic.unnest(
        ...     unnest_key="tags"
        ... )
        # Input:  [{"id": 1, "tags": ["a", "b"]}]
        # Output: [{"id": 1, "tags": "a"}, {"id": 1, "tags": "b"}]

        >>> # Unnest a dictionary column with specific fields
        >>> df.semantic.unnest(
        ...     unnest_key="user_info",
        ...     expand_fields=["name", "age"]
        ... )
        # Input:  [{"id": 1, "user_info": {"name": "Alice", "age": 30, "email": "alice@example.com"}}]
        # Output: [{"id": 1, "user_info": {...}, "name": "Alice", "age": 30}]

        >>> # Recursive unnesting
        >>> df.semantic.unnest(
        ...     unnest_key="nested_lists",
        ...     recursive=True,
        ...     depth=2
        ... )
    """
    # Convert DataFrame to list of dicts
    input_data = self._df.to_dict("records")

    # Create unnest operation config
    unnest_config = {
        "type": "unnest",
        "name": f"semantic_unnest_{len(self._history)}",
        "unnest_key": unnest_key,
        "keep_empty": keep_empty,
        "recursive": recursive,
        **kwargs,
    }

    # Add optional parameters if provided
    if expand_fields is not None:
        unnest_config["expand_fields"] = expand_fields
    if depth is not None:
        unnest_config["depth"] = depth

    # Create and execute unnest operation
    unnest_op = UnnestOperation(
        runner=self.runner,
        config=unnest_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = unnest_op.execute(input_data)

    return self._record_operation(results, "unnest", unnest_config, cost)

Example usage:

# Unnest a list column
df.semantic.unnest(unnest_key="tags")

# Unnest a dictionary column with specific fields
df.semantic.unnest(
    unnest_key="user_info",
    expand_fields=["name", "age"]
)

# Recursive unnesting with depth control
df.semantic.unnest(
    unnest_key="nested_lists",
    recursive=True,
    depth=2
)

Common Features

All operations support:

  1. Cost Tracking

    # After any operation
    print(f"Operation cost: ${df.semantic.total_cost}")
    

  2. Operation History

    # View operation history
    for op in df.semantic.history:
        print(f"{op.op_type}: {op.output_columns}")
    

  3. Validation Rules

    # Add validation rules to any  map or filter operation
    validate=["len(output['tags']) <= 5", "output['score'] >= 0"]
    

For more details on configuration options and best practices, refer to: - DocETL Best Practices - Pipeline Configuration - Output Schemas