Skip to content

Semantic Operations

The pandas integration provides several semantic operations through the .semantic accessor. Each operation is designed to handle specific types of transformations and analyses using LLMs.

All semantic operations return a new DataFrame that preserves the original columns and adds new columns based on the output_schema. For example, if your original DataFrame has a column text and you use map with an output_schema={"sentiment": "str", "keywords": "list[str]"}, the resulting DataFrame will have three columns: text, sentiment, and keywords. This makes it easy to chain operations and maintain data lineage.

Map Operation

Apply semantic mapping to each row using a language model.

Documentation: https://ucbepic.github.io/docetl/operators/map/

Parameters:

Name Type Description Default
prompt str

Jinja template string for generating prompts. Use {{input.column_name}} to reference input columns.

required
output_schema Dict[str, Any]

Dictionary defining the expected output structure and types. Example: {"entities": "list[str]", "sentiment": "str"}

required
**kwargs

Additional configuration options: - model: LLM model to use (default: from config) - batch_prompt: Template for processing multiple documents in a single prompt - max_batch_size: Maximum number of documents to process in a single batch - optimize: Flag to enable operation optimization (default: True) - recursively_optimize: Flag to enable recursive optimization (default: False) - sample: Number of samples to use for the operation - tools: List of tool definitions for LLM use - validate: List of Python expressions to validate output - num_retries_on_validate_failure: Number of retry attempts (default: 0) - gleaning: Configuration for LLM-based refinement - drop_keys: List of keys to drop from input - timeout: Timeout for each LLM call in seconds (default: 120) - max_retries_per_timeout: Maximum retries per timeout (default: 2) - litellm_completion_kwargs: Additional parameters for LiteLLM - skip_on_error: Skip operation if LLM returns error (default: False) - bypass_cache: Bypass cache for this operation (default: False)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame containing the transformed data with columns matching the output_schema.

Examples:

>>> # Extract entities and sentiment
>>> df.semantic.map(
...     prompt="Analyze this text: {{input.text}}",
...     output_schema={
...         "entities": "list[str]",
...         "sentiment": "str"
...     },
...     validate=["len(output['entities']) <= 5"],
...     num_retries_on_validate_failure=2
... )
Source code in docetl/apis/pd_accessors.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
def map(self, prompt: str, output_schema: Dict[str, Any], **kwargs) -> pd.DataFrame:
    """
    Apply semantic mapping to each row using a language model.

    Documentation: https://ucbepic.github.io/docetl/operators/map/

    Args:
        prompt: Jinja template string for generating prompts. Use {{input.column_name}}
               to reference input columns.
        output_schema: Dictionary defining the expected output structure and types.
                      Example: {"entities": "list[str]", "sentiment": "str"}
        **kwargs: Additional configuration options:
            - model: LLM model to use (default: from config)
            - batch_prompt: Template for processing multiple documents in a single prompt
            - max_batch_size: Maximum number of documents to process in a single batch
            - optimize: Flag to enable operation optimization (default: True)
            - recursively_optimize: Flag to enable recursive optimization (default: False)
            - sample: Number of samples to use for the operation
            - tools: List of tool definitions for LLM use
            - validate: List of Python expressions to validate output
            - num_retries_on_validate_failure: Number of retry attempts (default: 0)
            - gleaning: Configuration for LLM-based refinement
            - drop_keys: List of keys to drop from input
            - timeout: Timeout for each LLM call in seconds (default: 120)
            - max_retries_per_timeout: Maximum retries per timeout (default: 2)
            - litellm_completion_kwargs: Additional parameters for LiteLLM
            - skip_on_error: Skip operation if LLM returns error (default: False)
            - bypass_cache: Bypass cache for this operation (default: False)

    Returns:
        pd.DataFrame: A new DataFrame containing the transformed data with columns
                     matching the output_schema.

    Examples:
        >>> # Extract entities and sentiment
        >>> df.semantic.map(
        ...     prompt="Analyze this text: {{input.text}}",
        ...     output_schema={
        ...         "entities": "list[str]",
        ...         "sentiment": "str"
        ...     },
        ...     validate=["len(output['entities']) <= 5"],
        ...     num_retries_on_validate_failure=2
        ... )
    """
    # Convert DataFrame to list of dicts for DocETL
    input_data = self._df.to_dict("records")

    # Create map operation config
    map_config = {
        "type": "map",
        "name": f"semantic_map_{len(self._history)}",
        "prompt": prompt,
        "output": {"schema": output_schema},
        **kwargs,
    }

    # Create and execute map operation
    map_op = MapOperation(
        runner=self.runner,
        config=map_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = map_op.execute(input_data)

    return self._record_operation(results, "map", map_config, cost)

Example usage:

df.semantic.map(
    prompt="Extract sentiment and key points from: {{input.text}}",
    output_schema={
        "sentiment": "str",
        "key_points": "list[str]"
    },
    validate=["len(output['key_points']) <= 5"],
    num_retries_on_validate_failure=2
)

Filter Operation

Filter DataFrame rows based on semantic conditions.

Documentation: https://ucbepic.github.io/docetl/operators/filter/

Parameters:

Name Type Description Default
prompt str

Jinja template string for generating prompts

required
output_schema Optional[Dict[str, Any]]

Optional custom output schema. If None, defaults to

None
**kwargs

Additional configuration options: - model: LLM model to use - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2) - skip_on_error: Skip rows on LLM error (default: False) - bypass_cache: Bypass cache for this operation (default: False)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: Filtered DataFrame containing only rows where the model returned True

Examples:

>>> # Simple filtering
>>> df.semantic.filter(
...     prompt="Is this about technology? {{input.text}}"
... )
>>> # Custom output schema
>>> df.semantic.filter(
...     prompt="Analyze if this is relevant: {{input.text}}",
...     output_schema={
...         "keep": "bool",
...         "reason": "str"
...     }
... )
Source code in docetl/apis/pd_accessors.py
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
def filter(
    self, prompt: str, *, output_schema: Optional[Dict[str, Any]] = None, **kwargs
) -> pd.DataFrame:
    """
    Filter DataFrame rows based on semantic conditions.

    Documentation: https://ucbepic.github.io/docetl/operators/filter/

    Args:
        prompt: Jinja template string for generating prompts
        output_schema: Optional custom output schema. If None, defaults to
                      {"keep": "bool"}
        **kwargs: Additional configuration options:
            - model: LLM model to use
            - validate: List of validation expressions
            - num_retries_on_validate_failure: Number of retries
            - timeout: Timeout in seconds (default: 120)
            - max_retries_per_timeout: Max retries per timeout (default: 2)
            - skip_on_error: Skip rows on LLM error (default: False)
            - bypass_cache: Bypass cache for this operation (default: False)

    Returns:
        pd.DataFrame: Filtered DataFrame containing only rows where the model
                     returned True

    Examples:
        >>> # Simple filtering
        >>> df.semantic.filter(
        ...     prompt="Is this about technology? {{input.text}}"
        ... )

        >>> # Custom output schema
        >>> df.semantic.filter(
        ...     prompt="Analyze if this is relevant: {{input.text}}",
        ...     output_schema={
        ...         "keep": "bool",
        ...         "reason": "str"
        ...     }
        ... )
    """
    # Convert DataFrame to list of dicts
    input_data = self._df.to_dict("records")

    # Create map operation config for filtering
    filter_config = {
        "type": "map",
        "name": f"semantic_filter_{len(self._history)}",
        "prompt": prompt,
        "output": (
            {"schema": {"keep": "bool"}} if output_schema is None else output_schema
        ),
        **kwargs,
    }

    # Create and execute filter operation
    filter_op = FilterOperation(
        runner=self.runner,
        config=filter_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = filter_op.execute(input_data)

    return self._record_operation(results, "filter", filter_config, cost)

Example usage:

# Simple filtering
df.semantic.filter(
    prompt="Is this text about technology? {{input.text}}"
)

# Custom output schema with reasons
df.semantic.filter(
    prompt="Analyze if this is relevant: {{input.text}}",
    output_schema={
        "keep": "bool",
        "reason": "str"
    }
)

Merge Operation (Experimental)

Note: The merge operation is an experimental feature based on our equijoin operator. It provides a pandas-like interface for semantic record matching and deduplication. When fuzzy=True, it automatically invokes optimization to improve performance while maintaining accuracy.

Semantically merge two DataFrames based on flexible matching criteria.

Documentation: https://ucbepic.github.io/docetl/operators/equijoin/

When fuzzy=True and no blocking parameters are provided, this method automatically invokes the JoinOptimizer to generate efficient blocking conditions. The optimizer will suggest blocking thresholds and conditions to improve performance while maintaining the target recall. The optimized configuration will be displayed for future reuse.

Parameters:

Name Type Description Default
right DataFrame

Right DataFrame to merge with

required
comparison_prompt str

Prompt template for comparing records

required
fuzzy bool

Whether to use fuzzy matching with optimization (default: False)

False
**kwargs

Additional configuration options: - model: LLM model to use - blocking_threshold: Threshold for blocking optimization - blocking_conditions: Custom blocking conditions - target_recall: Target recall for optimization (default: 0.95) - estimated_selectivity: Estimated match rate - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: Merged DataFrame containing matched records

Examples:

>>> # Simple merge
>>> merged_df = df1.semantic.merge(
...     df2,
...     comparison_prompt="Are these records about the same entity? {{input1}} vs {{input2}}"
... )
>>> # Fuzzy merge with automatic optimization
>>> merged_df = df1.semantic.merge(
...     df2,
...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
...     fuzzy=True,
...     target_recall=0.9
... )
>>> # Fuzzy merge with manual blocking parameters
>>> merged_df = df1.semantic.merge(
...     df2,
...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
...     fuzzy=False,
...     blocking_threshold=0.8,
...     blocking_conditions=["input1.category == input2.category"]
... )
Source code in docetl/apis/pd_accessors.py
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
def merge(
    self,
    right: pd.DataFrame,
    comparison_prompt: str,
    *,
    fuzzy: bool = False,
    **kwargs,
) -> pd.DataFrame:
    """
    Semantically merge two DataFrames based on flexible matching criteria.

    Documentation: https://ucbepic.github.io/docetl/operators/equijoin/

    When fuzzy=True and no blocking parameters are provided, this method automatically
    invokes the JoinOptimizer to generate efficient blocking conditions. The optimizer
    will suggest blocking thresholds and conditions to improve performance while
    maintaining the target recall. The optimized configuration will be displayed
    for future reuse.

    Args:
        right: Right DataFrame to merge with
        comparison_prompt: Prompt template for comparing records
        fuzzy: Whether to use fuzzy matching with optimization (default: False)
        **kwargs: Additional configuration options:
            - model: LLM model to use
            - blocking_threshold: Threshold for blocking optimization
            - blocking_conditions: Custom blocking conditions
            - target_recall: Target recall for optimization (default: 0.95)
            - estimated_selectivity: Estimated match rate
            - validate: List of validation expressions
            - num_retries_on_validate_failure: Number of retries
            - timeout: Timeout in seconds (default: 120)
            - max_retries_per_timeout: Max retries per timeout (default: 2)

    Returns:
        pd.DataFrame: Merged DataFrame containing matched records

    Examples:
        >>> # Simple merge
        >>> merged_df = df1.semantic.merge(
        ...     df2,
        ...     comparison_prompt="Are these records about the same entity? {{input1}} vs {{input2}}"
        ... )

        >>> # Fuzzy merge with automatic optimization
        >>> merged_df = df1.semantic.merge(
        ...     df2,
        ...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
        ...     fuzzy=True,
        ...     target_recall=0.9
        ... )

        >>> # Fuzzy merge with manual blocking parameters
        >>> merged_df = df1.semantic.merge(
        ...     df2,
        ...     comparison_prompt="Compare: {{input1}} vs {{input2}}",
        ...     fuzzy=False,
        ...     blocking_threshold=0.8,
        ...     blocking_conditions=["input1.category == input2.category"]
        ... )
    """
    # Convert DataFrames to lists of dicts
    left_data = self._df.to_dict("records")
    right_data = right.to_dict("records")

    # Create equijoin operation config
    join_config = {
        "type": "equijoin",
        "name": f"semantic_merge_{len(self._history)}",
        "comparison_prompt": comparison_prompt,
        **kwargs,
    }

    # If fuzzy matching and no blocking params provided, use JoinOptimizer
    if (
        fuzzy
        and not kwargs.get("blocking_threshold")
        and not kwargs.get("blocking_conditions")
    ):
        join_optimizer = JoinOptimizer(
            self.runner,
            join_config,
            target_recall=kwargs.get("target_recall", 0.95),
            estimated_selectivity=kwargs.get("estimated_selectivity", None),
        )
        optimized_config, optimizer_cost, _ = join_optimizer.optimize_equijoin(
            left_data, right_data, skip_map_gen=True, skip_containment_gen=True
        )

        # Print optimized config for reuse
        self.runner.console.log(
            Panel.fit(
                "[bold cyan]Optimized Configuration for Merge[/bold cyan]\n"
                "[yellow]Consider adding these parameters to avoid re-running optimization:[/yellow]\n\n"
                f"blocking_threshold: {optimized_config.get('blocking_threshold')}\n"
                f"blocking_keys: {optimized_config.get('blocking_keys')}\n"
                f"blocking_conditions: {optimized_config.get('blocking_conditions', [])}\n",
                title="Optimization Results",
            )
        )
        join_config = optimized_config
        optimizer_cost_value = optimizer_cost
    else:
        optimizer_cost_value = 0.0

    # Create and execute equijoin operation
    join_op = EquijoinOperation(
        runner=self.runner,
        config=join_config,
        default_model=self.runner.config["default_model"],
        max_threads=self.runner.max_threads,
        console=self.runner.console,
        status=self.runner.status,
    )
    results, cost = join_op.execute(left_data, right_data)

    return self._record_operation(
        results, "equijoin", join_config, cost + optimizer_cost_value
    )

Example usage:

# Simple merge
merged_df = df1.semantic.merge(
    df2,
    comparison_prompt="Are these records about the same entity? {{input1}} vs {{input2}}"
)

# Fuzzy merge with optimization
merged_df = df1.semantic.merge(
    df2,
    comparison_prompt="Compare: {{input1}} vs {{input2}}",
    fuzzy=True,
    target_recall=0.9
)

Aggregate Operation

Semantically aggregate data with optional fuzzy matching.

Documentation: - Resolve Operation: https://ucbepic.github.io/docetl/operators/resolve/ - Reduce Operation: https://ucbepic.github.io/docetl/operators/reduce/

When fuzzy=True and no blocking parameters are provided in resolve_kwargs, this method automatically invokes the JoinOptimizer to generate efficient blocking conditions for the resolve phase. The optimizer will suggest blocking thresholds and conditions to improve performance while maintaining the target recall. The optimized configuration will be displayed for future reuse.

The resolve phase is skipped if: - fuzzy=False - reduce_keys=["_all"] - input data has 5 or fewer rows

Parameters:

Name Type Description Default
reduce_prompt str

Prompt template for the reduction phase

required
output_schema Dict[str, Any]

Schema for the final output

required
fuzzy bool

Whether to use fuzzy matching for resolution (default: False)

False
comparison_prompt Optional[str]

Prompt template for comparing records during resolution

None
resolution_prompt Optional[str]

Prompt template for resolving conflicts

None
resolution_output_schema Optional[Dict[str, Any]]

Schema for resolution output

None
reduce_keys Optional[Union[str, List[str]]]

Keys to group by for reduction (default: ["_all"])

['_all']
resolve_kwargs Dict[str, Any]

Additional kwargs for resolve operation: - model: LLM model to use - blocking_threshold: Threshold for blocking optimization - blocking_conditions: Custom blocking conditions - target_recall: Target recall for optimization (default: 0.95) - estimated_selectivity: Estimated match rate - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2)

{}
reduce_kwargs Dict[str, Any]

Additional kwargs for reduce operation: - model: LLM model to use - validate: List of validation expressions - num_retries_on_validate_failure: Number of retries - timeout: Timeout in seconds (default: 120) - max_retries_per_timeout: Max retries per timeout (default: 2)

{}

Returns:

Type Description
DataFrame

pd.DataFrame: Aggregated DataFrame with columns matching output_schema

Examples:

>>> # Simple aggregation
>>> df.semantic.agg(
...     reduce_prompt="Summarize these items: {{input.text}}",
...     output_schema={"summary": "str"}
... )
>>> # Fuzzy matching with automatic optimization
>>> df.semantic.agg(
...     reduce_prompt="Combine these items: {{input.text}}",
...     output_schema={"combined": "str"},
...     fuzzy=True,
...     comparison_prompt="Are these items similar: {{input1.text}} vs {{input2.text}}",
...     resolution_prompt="Resolve conflicts between: {{items}}",
...     resolution_output_schema={"resolved": "str"}
... )
>>> # Fuzzy matching with manual blocking parameters
>>> df.semantic.agg(
...     reduce_prompt="Combine these items: {{input.text}}",
...     output_schema={"combined": "str"},
...     fuzzy=False,
...     comparison_prompt="Compare items: {{input1.text}} vs {{input2.text}}",
...     resolve_kwargs={
...         "blocking_threshold": 0.8,
...         "blocking_conditions": ["input1.category == input2.category"]
...     }
... )
Source code in docetl/apis/pd_accessors.py
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
    def agg(
        self,
        *,
        # Reduction phase params (required)
        reduce_prompt: str,
        output_schema: Dict[str, Any],
        # Resolution and reduce phase params (optional)
        fuzzy: bool = False,
        comparison_prompt: Optional[str] = None,
        resolution_prompt: Optional[str] = None,
        resolution_output_schema: Optional[Dict[str, Any]] = None,
        reduce_keys: Optional[Union[str, List[str]]] = ["_all"],
        resolve_kwargs: Dict[str, Any] = {},
        reduce_kwargs: Dict[str, Any] = {},
    ) -> pd.DataFrame:
        """
        Semantically aggregate data with optional fuzzy matching.

        Documentation:
        - Resolve Operation: https://ucbepic.github.io/docetl/operators/resolve/
        - Reduce Operation: https://ucbepic.github.io/docetl/operators/reduce/

        When fuzzy=True and no blocking parameters are provided in resolve_kwargs,
        this method automatically invokes the JoinOptimizer to generate efficient
        blocking conditions for the resolve phase. The optimizer will suggest
        blocking thresholds and conditions to improve performance while maintaining
        the target recall. The optimized configuration will be displayed for future reuse.

        The resolve phase is skipped if:
        - fuzzy=False
        - reduce_keys=["_all"]
        - input data has 5 or fewer rows

        Args:
            reduce_prompt: Prompt template for the reduction phase
            output_schema: Schema for the final output
            fuzzy: Whether to use fuzzy matching for resolution (default: False)
            comparison_prompt: Prompt template for comparing records during resolution
            resolution_prompt: Prompt template for resolving conflicts
            resolution_output_schema: Schema for resolution output
            reduce_keys: Keys to group by for reduction (default: ["_all"])
            resolve_kwargs: Additional kwargs for resolve operation:
                - model: LLM model to use
                - blocking_threshold: Threshold for blocking optimization
                - blocking_conditions: Custom blocking conditions
                - target_recall: Target recall for optimization (default: 0.95)
                - estimated_selectivity: Estimated match rate
                - validate: List of validation expressions
                - num_retries_on_validate_failure: Number of retries
                - timeout: Timeout in seconds (default: 120)
                - max_retries_per_timeout: Max retries per timeout (default: 2)
            reduce_kwargs: Additional kwargs for reduce operation:
                - model: LLM model to use
                - validate: List of validation expressions
                - num_retries_on_validate_failure: Number of retries
                - timeout: Timeout in seconds (default: 120)
                - max_retries_per_timeout: Max retries per timeout (default: 2)

        Returns:
            pd.DataFrame: Aggregated DataFrame with columns matching output_schema

        Examples:
            >>> # Simple aggregation
            >>> df.semantic.agg(
            ...     reduce_prompt="Summarize these items: {{input.text}}",
            ...     output_schema={"summary": "str"}
            ... )

            >>> # Fuzzy matching with automatic optimization
            >>> df.semantic.agg(
            ...     reduce_prompt="Combine these items: {{input.text}}",
            ...     output_schema={"combined": "str"},
            ...     fuzzy=True,
            ...     comparison_prompt="Are these items similar: {{input1.text}} vs {{input2.text}}",
            ...     resolution_prompt="Resolve conflicts between: {{items}}",
            ...     resolution_output_schema={"resolved": "str"}
            ... )

            >>> # Fuzzy matching with manual blocking parameters
            >>> df.semantic.agg(
            ...     reduce_prompt="Combine these items: {{input.text}}",
            ...     output_schema={"combined": "str"},
            ...     fuzzy=False,
            ...     comparison_prompt="Compare items: {{input1.text}} vs {{input2.text}}",
            ...     resolve_kwargs={
            ...         "blocking_threshold": 0.8,
            ...         "blocking_conditions": ["input1.category == input2.category"]
            ...     }
            ... )
        """
        input_data = self._df.to_dict("records")

        # Change keys to list
        if isinstance(reduce_keys, str):
            reduce_keys = [reduce_keys]

        # Skip resolution if using _all or fuzzy is False
        if reduce_keys == ["_all"] or not fuzzy or len(input_data) <= 5:
            resolved_data = input_data
        else:
            # Synthesize comparison prompt if not provided
            if comparison_prompt is None:
                # Build record template from reduce_keys
                record_template = ", ".join(
                    f"{key}: {{{{ input{0}.{key} }}}}" for key in reduce_keys
                )

                # Add context about how fields were created
                context = self._synthesize_comparison_context(reduce_keys)

                comparison_prompt = f"""Do the following two records represent the same concept? Your answer should be true or false.{context}

Record 1: {record_template.replace('input0', 'input1')}
Record 2: {record_template.replace('input0', 'input2')}"""

            # Configure resolution
            resolve_config = {
                "type": "resolve",
                "name": f"semantic_resolve_{len(self._history)}",
                "comparison_prompt": comparison_prompt,
                **resolve_kwargs,
            }

            # Add resolution prompt and schema if provided
            if resolution_prompt:
                resolve_config["resolution_prompt"] = resolution_prompt
                resolve_config["output"] = {
                    "schema": resolution_output_schema,
                    "keys": resolution_output_schema.keys(),
                }
            else:
                resolve_config["output"] = {"keys": reduce_keys}

            # If blocking params not provided, use JoinOptimizer to synthesize them
            if not resolve_kwargs or (
                "blocking_threshold" not in resolve_kwargs
                and "blocking_conditions" not in resolve_kwargs
            ):
                join_optimizer = JoinOptimizer(
                    self.runner,
                    resolve_config,
                    target_recall=(
                        resolve_kwargs.get("target_recall", 0.95)
                        if resolve_kwargs
                        else 0.95
                    ),
                    estimated_selectivity=(
                        resolve_kwargs.get("estimated_selectivity", None)
                        if resolve_kwargs
                        else None
                    ),
                )
                optimized_config, optimizer_cost = join_optimizer.optimize_resolve(
                    input_data
                )

                # Print optimized config for reuse
                self.runner.console.log(
                    Panel.fit(
                        "[bold cyan]Optimized Configuration for Resolve[/bold cyan]\n"
                        "[yellow]Consider adding these parameters to avoid re-running optimization:[/yellow]\n\n"
                        f"blocking_threshold: {optimized_config.get('blocking_threshold')}\n"
                        f"blocking_keys: {optimized_config.get('blocking_keys')}\n"
                        f"blocking_conditions: {optimized_config.get('blocking_conditions', [])}\n",
                        title="Optimization Results",
                    )
                )
            else:
                # Use provided blocking params
                optimized_config = resolve_config.copy()
                optimizer_cost = 0.0

            # Execute resolution with optimized config
            resolve_op = ResolveOperation(
                runner=self.runner,
                config=optimized_config,
                default_model=self.runner.config["default_model"],
                max_threads=self.runner.max_threads,
                console=self.runner.console,
                status=self.runner.status,
            )
            resolved_data, resolve_cost = resolve_op.execute(input_data)
            _ = self._record_operation(
                resolved_data,
                "resolve",
                optimized_config,
                resolve_cost + optimizer_cost,
            )

        # Configure reduction
        reduce_config = {
            "type": "reduce",
            "name": f"semantic_reduce_{len(self._history)}",
            "reduce_key": reduce_keys,
            "prompt": reduce_prompt,
            "output": {"schema": output_schema},
            **reduce_kwargs,
        }

        # Execute reduction
        reduce_op = ReduceOperation(
            runner=self.runner,
            config=reduce_config,
            default_model=self.runner.config["default_model"],
            max_threads=self.runner.max_threads,
            console=self.runner.console,
            status=self.runner.status,
        )
        results, reduce_cost = reduce_op.execute(resolved_data)

        return self._record_operation(results, "reduce", reduce_config, reduce_cost)

Example usage:

# Simple aggregation
df.semantic.agg(
    reduce_prompt="Summarize these items: {{input.text}}",
    output_schema={"summary": "str"}
)

# Fuzzy matching with custom resolution
df.semantic.agg(
    reduce_prompt="Combine these items: {{input.text}}",
    output_schema={"combined": "str"},
    fuzzy=True,
    comparison_prompt="Are these items similar: {{input1.text}} vs {{input2.text}}",
    resolution_prompt="Resolve conflicts between: {{items}}",
    resolution_output_schema={"resolved": "str"}
)

Common Features

All operations support:

  1. Cost Tracking

    # After any operation
    print(f"Operation cost: ${df.semantic.total_cost}")
    

  2. Operation History

    # View operation history
    for op in df.semantic.history:
        print(f"{op.op_type}: {op.output_columns}")
    

  3. Validation Rules

    # Add validation rules to any  map or filter operation
    validate=["len(output['tags']) <= 5", "output['score'] >= 0"]
    

For more details on configuration options and best practices, refer to: - DocETL Best Practices - Pipeline Configuration - Output Schemas