Resolve Operation

The Resolve operation identifies and canonicalizes duplicate entities in your data. LLM-generated fields and multi-source data often refer to the same entity inconsistently (e.g., "Mrs. Smith" vs. "Jane Smith"); resolving the field standardizes it before further analysis.

flowchart LR
    a["doc: 'Jon Smith'"] --> a2["doc: 'John Smith'"]
    b["doc: 'John Smith'"] --> b2["doc: 'John Smith'"]
    c["doc: 'Alice Wong'"] --> c2["doc: 'Alice Wong'"]
    d["..."] --> d2["..."]

Example: Standardizing Patient Names

YAMLPython

- name: standardize_patient_names
  type: resolve
  optimize: true
  comparison_prompt: |
    Compare the following two patient name entries:

    Patient 1: {{ input1.patient_name }}
    Date of Birth 1: {{ input1.date_of_birth }}

    Patient 2: {{ input2.patient_name }}
    Date of Birth 2: {{ input2.date_of_birth }}

    Are these entries likely referring to the same patient? Consider name similarity and date of birth. Respond with "True" if they are likely the same patient, or "False" if they are likely different patients.
  resolution_prompt: |
    Standardize the following patient name entries into a single, consistent format:

    {% for entry in inputs %}
    Patient Name {{ loop.index }}: {{ entry.patient_name }}
    {% endfor %}

    Provide a single, standardized patient name that represents all the matched entries. Use the format "LastName, FirstName MiddleInitial" if available.
  output:
    schema:
      patient_name: string

import docetl

docetl.default_model = "gpt-4o-mini"

frame = docetl.read_json("patients.json")
frame = frame.resolve(
    optimize=True,
    comparison_prompt="""Compare the following two patient name entries:

Patient 1: {{ input1.patient_name }}
Date of Birth 1: {{ input1.date_of_birth }}

Patient 2: {{ input2.patient_name }}
Date of Birth 2: {{ input2.date_of_birth }}

Are these entries likely referring to the same patient? Consider name similarity and date of birth. Respond with "True" if they are likely the same patient, or "False" if they are likely different patients.""",
    resolution_prompt="""Standardize the following patient name entries into a single, consistent format:

{% for entry in inputs %}
Patient Name {{ loop.index }}: {{ entry.patient_name }}
{% endfor %}

Provide a single, standardized patient name that represents all the matched entries. Use the format "LastName, FirstName MiddleInitial" if available.""",
    output={"schema": {"patient_name": "string"}},
)
rows = frame.collect()

The comparison_prompt compares pairs of entries; reference the two documents via input1 and input2.
For identified duplicates, the resolution_prompt generates a standardized value; reference all matched entries via the inputs variable.
Prompts use Jinja2 syntax (e.g., input1.patient_name).

Automatic Blocking

If you don't specify any blocking configuration (blocking_threshold, blocking_conditions, or limit_comparisons), the Resolve operation will automatically compute an optimal embedding-based blocking threshold at runtime. It samples pairs from your data, runs LLM comparisons on the sample, and finds a threshold that achieves 95% recall by default. You can adjust this with the blocking_target_recall parameter.

Blocking

Blocking reduces the number of comparisons by only comparing entries that are likely to be matches. Two types work together:

Code-based blocking: Apply custom Python expressions to determine if a pair should be compared.
Embedding-based blocking: Compare embeddings of specified fields and only process pairs above a certain similarity threshold.

How Blocking Works

The Resolve operation creates a union of pairs that pass either blocking method: - First, pairs that satisfy any of the blocking_conditions are selected - Then, pairs that meet the blocking_threshold for embedding similarity are added (if not already included) - When sampling is needed (via limit_comparisons), code-based pairs are prioritized over embedding-based pairs

Example with both blocking methods:

YAMLPython

- name: standardize_patient_names
  type: resolve
  comparison_prompt: |
    # (Same as previous example)
  resolution_prompt: |
    # (Same as previous example)
  output:
    schema:
      patient_name: string
  blocking_keys:
    - last_name
    - date_of_birth
  blocking_threshold: 0.8
  blocking_conditions:
    - "input1['last_name'][:2].lower() == input2['last_name'][:2].lower()"
    - "input1['first_name'][:2].lower() == input2['first_name'][:2].lower()"
    - "input1['date_of_birth'] == input2['date_of_birth']"
    - "input1['ssn'][-4:] == input2['ssn'][-4:]"

frame = frame.resolve(
    name="standardize_patient_names",
    comparison_prompt="...",  # (Same as previous example)
    resolution_prompt="...",  # (Same as previous example)
    output={"schema": {"patient_name": "string"}},
    blocking_keys=["last_name", "date_of_birth"],
    blocking_threshold=0.8,
    blocking_conditions=[
        "input1['last_name'][:2].lower() == input2['last_name'][:2].lower()",
        "input1['first_name'][:2].lower() == input2['first_name'][:2].lower()",
        "input1['date_of_birth'] == input2['date_of_birth']",
        "input1['ssn'][-4:] == input2['ssn'][-4:]",
    ],
)

In this example, pairs will be considered for comparison if they satisfy any of the following:

Code-based conditions: - The last_name fields start with the same two characters, OR - The first_name fields start with the same two characters, OR
- The date_of_birth fields match exactly, OR - The last four digits of the ssn fields match

OR

Embedding-based condition: - The embedding similarity of their last_name and date_of_birth fields is above 0.8

How the Comparison Algorithm Works

After determining eligible pairs, the Resolve operation groups similar items with a Union-Find (Disjoint Set Union) algorithm:

Initialization: Each item starts in its own cluster.
Pair Generation: All possible pairs of items are generated for comparison.
Batch Processing: Pairs are processed in batches (controlled by compare_batch_size).
Comparison: For each batch: a. An LLM performs pairwise comparisons to determine if items match. b. Matching pairs trigger a merge_clusters operation to combine their clusters.
Iteration: Steps 3-4 repeat until all pairs are compared.
Result Collection: All non-empty clusters are collected as the final result.

Efficiency

Batched comparisons let clusters update incrementally as matches are found, with LLM calls running in parallel. Parallelism is capped at the batch size, so set compare_batch_size based on your dataset size and rate limits.

Required Parameters

type: Must be set to "resolve".
comparison_prompt: The prompt template to use for comparing potential matches.
resolution_prompt: The prompt template to use for reducing matched entries.
output: Schema definition for the output from the LLM.

Optional Parameters

Parameter	Description	Default
`embedding_model`	The model to use for creating embeddings	Falls back to `default_model`
`resolution_model`	The language model to use for reducing matched entries	Falls back to `default_model`
`comparison_model`	The language model to use for comparing potential matches	Falls back to `default_model`
`blocking_keys`	List of keys to use for initial blocking	All keys in the input data
`blocking_threshold`	Embedding similarity threshold for considering entries as potential matches	Auto-computed if not set
`blocking_target_recall`	Target recall when auto-computing blocking threshold (0.0 to 1.0)	0.95
`blocking_conditions`	List of conditions for initial blocking	[]
`input`	Specifies the schema or keys to subselect from each item to pass into the prompts	All keys from input items
`embedding_batch_size`	The number of entries to send to the embedding model at a time	1000
`compare_batch_size`	The number of entity pairs processed in each batch during the comparison phase	500
`limit_comparisons`	Maximum number of comparisons to perform	None
`timeout`	Timeout for each LLM call in seconds	120
`max_retries_per_timeout`	Maximum number of retries per timeout	2
`sample`	Number of samples to use for the operation	None
`litellm_completion_kwargs`	Additional parameters to pass to LiteLLM completion calls.	{}
`bypass_cache`	If true, bypass the cache for this operation.	False
`cascade`	Model cascade config for cost reduction on candidate pair comparisons (see below)	None

Model Cascade (cost reduction)

A cascade block runs a cheap proxy model on all candidate pairs first and only escalates uncertain comparisons to the expensive oracle, with a statistical quality guarantee. Most effective when blocking produces many candidate pairs.

YAMLPython

cascade:
  proxy_model: gpt-4o-mini
  target: 0.9

# Pass via the cascade= kwarg on a resolve call
frame = frame.resolve(
    ...,
    cascade={"proxy_model": "gpt-4o-mini", "target": 0.9},
)

Parameter	Description	Default
`proxy_model`	The cheap model for the proxy pass (required)	—
`guarantee`	`accuracy`, `precision`, `recall`, or `precision+recall`	`precision`
`target`	Target value for the guarantee metric, in `(0, 1)` (required)	—
`delta`	Failure probability; guarantee holds w.p. `1 - delta`	`0.05`
`label_budget`	Max oracle calls spent learning the threshold	`400`

See Model Cascades with BARGAIN for full details, guarantee explanations, and examples.

Best Practices

Anticipate Resolve Needs: If you anticipate needing a Resolve operation and want to control the prompts, create it in your pipeline and let the optimizer find the appropriate blocking rules and thresholds.
Let the Optimizer Help: The optimizer can detect if you need a Resolve operation (e.g., because there's a downstream reduce operation you're optimizing) and can create one with suitable prompts and blocking rules.
Optimize Batch Size: compare_batch_size caps parallelism, so increase it when comparing many pairs — but very large values can exceed memory or API rate limits.