Resolve Operation
The Resolve operation in DocETL identifies and canonicalizes duplicate entities in your data. It's particularly useful when dealing with inconsistencies that can arise from LLM-generated content, or data from multiple sources.
Motivation
Map operations executed by LLMs may sometimes yield inconsistent results, even when referring to the same entity. For example, when extracting patient names from medical transcripts, you might end up with variations like "Mrs. Smith" and "Jane Smith" for the same person. In such cases, a Resolve operation on the patient_name
field can help standardize patient names before conducting further analysis.
🚀 Example: Standardizing Patient Names
Let's see a practical example of using the Resolve operation to standardize patient names extracted from medical transcripts.
- name: standardize_patient_names
type: resolve
optimize: true
comparison_prompt: |
Compare the following two patient name entries:
Patient 1: {{ input1.patient_name }}
Date of Birth 1: {{ input1.date_of_birth }}
Patient 2: {{ input2.patient_name }}
Date of Birth 2: {{ input2.date_of_birth }}
Are these entries likely referring to the same patient? Consider name similarity and date of birth. Respond with "True" if they are likely the same patient, or "False" if they are likely different patients.
resolution_prompt: |
Standardize the following patient name entries into a single, consistent format:
{% for entry in inputs %}
Patient Name {{ loop.index }}: {{ entry.patient_name }}
{% endfor %}
Provide a single, standardized patient name that represents all the matched entries. Use the format "LastName, FirstName MiddleInitial" if available.
output:
schema:
patient_name: string
This Resolve operation processes patient names to identify and standardize duplicates:
- Compares all pairs of patient names using the
comparison_prompt
. In the prompt, you can reference to the documents viainput1
andinput2
. - For identified duplicates, it applies the
resolution_prompt
to generate a standardized name. You can reference all matched entries via theinputs
variable.
Note: The prompt templates use Jinja2 syntax, allowing you to reference input fields directly (e.g., input1.patient_name
).
Performance Consideration
You should not run this operation as-is unless your dataset is small! Running O(n^2) comparisons with an LLM can be extremely time-consuming for large datasets. Instead, optimize your pipeline first using docetl build pipeline.yaml
and run the optimized version, which will generate efficient blocking rules for the operation. Make sure you've set optimize: true
in your resolve operation config.
Blocking
To improve efficiency, the Resolve operation supports "blocking" - a technique to reduce the number of comparisons by only comparing entries that are likely to be matches. DocETL supports two types of blocking:
- Embedding similarity: Compare embeddings of specified fields and only process pairs above a certain similarity threshold.
- Python conditions: Apply custom Python expressions to determine if a pair should be compared.
Here's an example of a Resolve operation with blocking:
- name: standardize_patient_names
type: resolve
comparison_prompt: |
# (Same as previous example)
resolution_prompt: |
# (Same as previous example)
output:
schema:
patient_name: string
blocking_keys:
- last_name
- date_of_birth
blocking_threshold: 0.8
blocking_conditions:
- "left['last_name'][:2].lower() == right['last_name'][:2].lower()"
- "left['first_name'][:2].lower() == right['first_name'][:2].lower()"
- "left['date_of_birth'] == right['date_of_birth']"
- "left['ssn'][-4:] == right['ssn'][-4:]"
In this example, pairs will be considered for comparison if:
- The embedding similarity of their
last_name
anddate_of_birth
fields is above 0.8, OR - The
last_name
fields start with the same two characters, OR - The
first_name
fields start with the same two characters, OR - The
date_of_birth
fields match exactly, OR - The last four digits of the
ssn
fields match.
How the Comparison Algorithm Works
After determining eligible pairs for comparison, the Resolve operation uses a Union-Find (Disjoint Set Union) algorithm to efficiently group similar items. Here's a breakdown of the process:
- Initialization: Each item starts in its own cluster.
- Pair Generation: All possible pairs of items are generated for comparison.
- Batch Processing: Pairs are processed in batches (controlled by
compare_batch_size
). - Comparison: For each batch:
a. An LLM performs pairwise comparisons to determine if items match.
b. Matching pairs trigger a
merge_clusters
operation to combine their clusters. - Iteration: Steps 3-4 repeat until all pairs are compared.
- Result Collection: All non-empty clusters are collected as the final result.
Efficiency
The batch processing of comparisons allows for efficient, incremental clustering as matches are found, without needing to rebuild the entire cluster structure after each match. This allows for parallelization of LLM calls, improving overall performance. However, this also limits parallelism to the batch size, so choose an appropriate value for compare_batch_size
based on your dataset size and system capabilities.
Required Parameters
type
: Must be set to "resolve".comparison_prompt
: The prompt template to use for comparing potential matches.resolution_prompt
: The prompt template to use for reducing matched entries.output
: Schema definition for the output from the LLM.
Optional Parameters
Parameter | Description | Default |
---|---|---|
embedding_model |
The model to use for creating embeddings | Falls back to default_model |
resolution_model |
The language model to use for reducing matched entries | Falls back to default_model |
comparison_model |
The language model to use for comparing potential matches | Falls back to default_model |
blocking_keys |
List of keys to use for initial blocking | All keys in the input data |
blocking_threshold |
Embedding similarity threshold for considering entries as potential matches | None |
blocking_conditions |
List of conditions for initial blocking | [] |
input |
Specifies the schema or keys to subselect from each item to pass into the prompts | All keys from input items |
embedding_batch_size |
The number of entries to send to the embedding model at a time | 1000 |
compare_batch_size |
The number of entity pairs processed in each batch during the comparison phase | 500 |
limit_comparisons |
Maximum number of comparisons to perform | None |
timeout |
Timeout for each LLM call in seconds | 120 |
max_retries_per_timeout |
Maximum number of retries per timeout | 2 |
sample |
Number of samples to use for the operation | None |
litellm_completion_kwargs |
Additional parameters to pass to LiteLLM completion calls. | {} |
Best Practices
- Anticipate Resolve Needs: If you anticipate needing a Resolve operation and want to control the prompts, create it in your pipeline and let the optimizer find the appropriate blocking rules and thresholds.
- Let the Optimizer Help: The optimizer can detect if you need a Resolve operation (e.g., because there's a downstream reduce operation you're optimizing) and can create a Resolve operation with suitable prompts and blocking rules.
- Effective Comparison Prompts: Design comparison prompts that consider all relevant factors for determining matches.
- Detailed Resolution Prompts: Create resolution prompts that effectively standardize or combine information from matched records.
- Appropriate Model Selection: Choose suitable models for embedding (if used) and language tasks.
- Optimize Batch Size: If you expect to compare a large number of pairs, consider increasing the
compare_batch_size
. This parameter effectively limits parallelism, so a larger value can improve performance for large datasets.
Balancing Batch Size
While increasing compare_batch_size
can improve parallelism, be cautious not to set it too high. Extremely large batch sizes might overwhelm system memory or exceed API rate limits. Consider your system's capabilities and the characteristics of your dataset when adjusting this parameter.
The Resolve operation is particularly useful for data cleaning, deduplication, and creating standardized records from multiple data sources. It can significantly improve data quality and consistency in your dataset.