Resolve Operation
The Resolve operation identifies and canonicalizes duplicate entities in your data. LLM-generated fields and multi-source data often refer to the same entity inconsistently (e.g., "Mrs. Smith" vs. "Jane Smith"); resolving the field standardizes it before further analysis.
flowchart LR
a["doc: 'Jon Smith'"] --> a2["doc: 'John Smith'"]
b["doc: 'John Smith'"] --> b2["doc: 'John Smith'"]
c["doc: 'Alice Wong'"] --> c2["doc: 'Alice Wong'"]
d["..."] --> d2["..."]
Example: Standardizing Patient Names
- name: standardize_patient_names
type: resolve
optimize: true
comparison_prompt: |
Compare the following two patient name entries:
Patient 1: {{ input1.patient_name }}
Date of Birth 1: {{ input1.date_of_birth }}
Patient 2: {{ input2.patient_name }}
Date of Birth 2: {{ input2.date_of_birth }}
Are these entries likely referring to the same patient? Consider name similarity and date of birth. Respond with "True" if they are likely the same patient, or "False" if they are likely different patients.
resolution_prompt: |
Standardize the following patient name entries into a single, consistent format:
{% for entry in inputs %}
Patient Name {{ loop.index }}: {{ entry.patient_name }}
{% endfor %}
Provide a single, standardized patient name that represents all the matched entries. Use the format "LastName, FirstName MiddleInitial" if available.
output:
schema:
patient_name: string
import docetl
docetl.default_model = "gpt-4o-mini"
frame = docetl.read_json("patients.json")
frame = frame.resolve(
optimize=True,
comparison_prompt="""Compare the following two patient name entries:
Patient 1: {{ input1.patient_name }}
Date of Birth 1: {{ input1.date_of_birth }}
Patient 2: {{ input2.patient_name }}
Date of Birth 2: {{ input2.date_of_birth }}
Are these entries likely referring to the same patient? Consider name similarity and date of birth. Respond with "True" if they are likely the same patient, or "False" if they are likely different patients.""",
resolution_prompt="""Standardize the following patient name entries into a single, consistent format:
{% for entry in inputs %}
Patient Name {{ loop.index }}: {{ entry.patient_name }}
{% endfor %}
Provide a single, standardized patient name that represents all the matched entries. Use the format "LastName, FirstName MiddleInitial" if available.""",
output={"schema": {"patient_name": "string"}},
)
rows = frame.collect()
- The
comparison_promptcompares pairs of entries; reference the two documents viainput1andinput2. - For identified duplicates, the
resolution_promptgenerates a standardized value; reference all matched entries via theinputsvariable. - Prompts use Jinja2 syntax (e.g.,
input1.patient_name).
Automatic Blocking
If you don't specify any blocking configuration (blocking_threshold, blocking_conditions, or limit_comparisons), the Resolve operation will automatically compute an optimal embedding-based blocking threshold at runtime. It samples pairs from your data, runs LLM comparisons on the sample, and finds a threshold that achieves 95% recall by default. You can adjust this with the blocking_target_recall parameter.
Blocking
Blocking reduces the number of comparisons by only comparing entries that are likely to be matches. Two types work together:
- Code-based blocking: Apply custom Python expressions to determine if a pair should be compared.
- Embedding-based blocking: Compare embeddings of specified fields and only process pairs above a certain similarity threshold.
How Blocking Works
The Resolve operation creates a union of pairs that pass either blocking method:
- First, pairs that satisfy any of the blocking_conditions are selected
- Then, pairs that meet the blocking_threshold for embedding similarity are added (if not already included)
- When sampling is needed (via limit_comparisons), code-based pairs are prioritized over embedding-based pairs
Example with both blocking methods:
- name: standardize_patient_names
type: resolve
comparison_prompt: |
# (Same as previous example)
resolution_prompt: |
# (Same as previous example)
output:
schema:
patient_name: string
blocking_keys:
- last_name
- date_of_birth
blocking_threshold: 0.8
blocking_conditions:
- "input1['last_name'][:2].lower() == input2['last_name'][:2].lower()"
- "input1['first_name'][:2].lower() == input2['first_name'][:2].lower()"
- "input1['date_of_birth'] == input2['date_of_birth']"
- "input1['ssn'][-4:] == input2['ssn'][-4:]"
frame = frame.resolve(
name="standardize_patient_names",
comparison_prompt="...", # (Same as previous example)
resolution_prompt="...", # (Same as previous example)
output={"schema": {"patient_name": "string"}},
blocking_keys=["last_name", "date_of_birth"],
blocking_threshold=0.8,
blocking_conditions=[
"input1['last_name'][:2].lower() == input2['last_name'][:2].lower()",
"input1['first_name'][:2].lower() == input2['first_name'][:2].lower()",
"input1['date_of_birth'] == input2['date_of_birth']",
"input1['ssn'][-4:] == input2['ssn'][-4:]",
],
)
In this example, pairs will be considered for comparison if they satisfy any of the following:
Code-based conditions:
- The last_name fields start with the same two characters, OR
- The first_name fields start with the same two characters, OR
- The date_of_birth fields match exactly, OR
- The last four digits of the ssn fields match
OR
Embedding-based condition:
- The embedding similarity of their last_name and date_of_birth fields is above 0.8
How the Comparison Algorithm Works
After determining eligible pairs, the Resolve operation groups similar items with a Union-Find (Disjoint Set Union) algorithm:
- Initialization: Each item starts in its own cluster.
- Pair Generation: All possible pairs of items are generated for comparison.
- Batch Processing: Pairs are processed in batches (controlled by
compare_batch_size). - Comparison: For each batch:
a. An LLM performs pairwise comparisons to determine if items match.
b. Matching pairs trigger a
merge_clustersoperation to combine their clusters. - Iteration: Steps 3-4 repeat until all pairs are compared.
- Result Collection: All non-empty clusters are collected as the final result.
Efficiency
Batched comparisons let clusters update incrementally as matches are found, with LLM calls running in parallel. Parallelism is capped at the batch size, so set compare_batch_size based on your dataset size and rate limits.
Required Parameters
type: Must be set to "resolve".comparison_prompt: The prompt template to use for comparing potential matches.resolution_prompt: The prompt template to use for reducing matched entries.output: Schema definition for the output from the LLM.
Optional Parameters
| Parameter | Description | Default |
|---|---|---|
embedding_model |
The model to use for creating embeddings | Falls back to default_model |
resolution_model |
The language model to use for reducing matched entries | Falls back to default_model |
comparison_model |
The language model to use for comparing potential matches | Falls back to default_model |
blocking_keys |
List of keys to use for initial blocking | All keys in the input data |
blocking_threshold |
Embedding similarity threshold for considering entries as potential matches | Auto-computed if not set |
blocking_target_recall |
Target recall when auto-computing blocking threshold (0.0 to 1.0) | 0.95 |
blocking_conditions |
List of conditions for initial blocking | [] |
input |
Specifies the schema or keys to subselect from each item to pass into the prompts | All keys from input items |
embedding_batch_size |
The number of entries to send to the embedding model at a time | 1000 |
compare_batch_size |
The number of entity pairs processed in each batch during the comparison phase | 500 |
limit_comparisons |
Maximum number of comparisons to perform | None |
timeout |
Timeout for each LLM call in seconds | 120 |
max_retries_per_timeout |
Maximum number of retries per timeout | 2 |
sample |
Number of samples to use for the operation | None |
litellm_completion_kwargs |
Additional parameters to pass to LiteLLM completion calls. | {} |
bypass_cache |
If true, bypass the cache for this operation. | False |
cascade |
Model cascade config for cost reduction on candidate pair comparisons (see below) | None |
Model Cascade (cost reduction)
A cascade block runs a cheap proxy model on all candidate pairs first and only escalates uncertain comparisons to the expensive oracle, with a statistical quality guarantee. Most effective when blocking produces many candidate pairs.
cascade:
proxy_model: gpt-4o-mini
target: 0.9
# Pass via the cascade= kwarg on a resolve call
frame = frame.resolve(
...,
cascade={"proxy_model": "gpt-4o-mini", "target": 0.9},
)
| Parameter | Description | Default |
|---|---|---|
proxy_model |
The cheap model for the proxy pass (required) | — |
guarantee |
accuracy, precision, recall, or precision+recall |
precision |
target |
Target value for the guarantee metric, in (0, 1) (required) |
— |
delta |
Failure probability; guarantee holds w.p. 1 - delta |
0.05 |
label_budget |
Max oracle calls spent learning the threshold | 400 |
See Model Cascades with BARGAIN for full details, guarantee explanations, and examples.
Best Practices
- Anticipate Resolve Needs: If you anticipate needing a Resolve operation and want to control the prompts, create it in your pipeline and let the optimizer find the appropriate blocking rules and thresholds.
- Let the Optimizer Help: The optimizer can detect if you need a Resolve operation (e.g., because there's a downstream reduce operation you're optimizing) and can create one with suitable prompts and blocking rules.
- Optimize Batch Size:
compare_batch_sizecaps parallelism, so increase it when comparing many pairs — but very large values can exceed memory or API rate limits.