Equijoin Operation (Experimental)
The Equijoin operation in DocETL is an experimental feature designed for joining two datasets based on flexible, LLM-powered criteria. It leverages many of the same techniques as the Resolve operation, but applies them to the task of joining datasets rather than deduplicating within a single dataset.
Motivation
While traditional database joins rely on exact matches, real-world data often requires more nuanced joining criteria. Equijoin allows for joins based on semantic similarity or complex conditions, making it ideal for scenarios where exact matches are impossible or undesirable.
🚀 Example: Matching Job Candidates to Job Postings
Let's explore a practical example of using the Equijoin operation to match job candidates with suitable job postings based on skills and experience.
- name: match_candidates_to_jobs
type: equijoin
comparison_prompt: |
Compare the following job candidate and job posting:
Candidate Skills: {{ left.skills }}
Candidate Experience: {{ left.years_experience }}
Job Required Skills: {{ right.required_skills }}
Job Desired Experience: {{ right.desired_experience }}
Is this candidate a good match for the job? Consider both the overlap in skills and the candidate's experience level. Respond with "True" if it's a good match, or "False" if it's not a suitable match.
output:
schema:
match_score: float
match_rationale: string
This Equijoin operation matches job candidates to job postings:
- It uses the
comparison_prompt
to determine if a candidate is a good match for a job. - The operation can be optimized to use efficient blocking rules, reducing the number of comparisons.
Jinja2 Syntax with left and right
The prompt template uses Jinja2 syntax, allowing you to reference input fields directly (e.g., left.skills
). You can reference the left and right documents using left
and right
respectively.
Performance Consideration
For large datasets, running comparisons with an LLM can be time-consuming. It's recommended to optimize your pipeline using docetl build pipeline.yaml
to generate efficient blocking rules for the operation.
Blocking
Like the Resolve operation, Equijoin supports blocking techniques to improve efficiency. For details on how blocking works and how to implement it, please refer to the Blocking section in the Resolve operation documentation.
Parameters
Equijoin shares many parameters with the Resolve operation. For a detailed list of required and optional parameters, please see the Parameters section in the Resolve operation documentation.
Key differences for Equijoin include:
resolution_prompt
is not used in Equijoin.limits
parameter is specific to Equijoin, allowing you to set maximum matches for each left and right item.
Incorporating Into a Pipeline
Here's an example of how to incorporate the Equijoin operation into a pipeline using the job candidate matching scenario:
model: gpt-4o-mini
datasets:
candidates:
type: file
path: /path/to/candidates.json
job_postings:
type: file
path: /path/to/job_postings.json
operations:
match_candidates_to_jobs:
type: equijoin
join_key:
left:
name: candidate_id
right:
name: job_id
comparison_prompt: |
Compare the following job candidate and job posting:
Candidate Skills: {{ left.skills }}
Candidate Experience: {{ left.years_experience }}
Job Required Skills: {{ right.required_skills }}
Job Desired Experience: {{ right.desired_experience }}
Is this candidate a good match for the job? Consider both the overlap in skills and the candidate's experience level. Respond with "True" if it's a good match, or "False" if it's not a suitable match.
output:
schema:
match_score: float
match_rationale: string
pipeline:
steps:
- name: match_candidates_to_jobs
operations:
- match_candidates_to_jobs:
left: candidates
right: job_postings
output:
type: file
path: "/path/to/matched_candidates_jobs.json"
This pipeline configuration demonstrates how to use the Equijoin operation to match job candidates with job postings. The pipeline reads candidate and job posting data from JSON files, performs the matching using the defined comparison prompt, and outputs the results to a new JSON file.
Best Practices
- Leverage the Optimizer: Use
docetl build pipeline.yaml
to automatically generate efficient blocking rules for your Equijoin operation. - Craft Thoughtful Comparison Prompts: Design prompts that effectively determine whether two records should be joined based on your specific use case.
- Balance Precision and Recall: When optimizing, consider the trade-off between catching all potential matches and reducing unnecessary comparisons.
- Mind Resource Constraints: Use
limit_comparisons
if you need to cap the total number of comparisons for large datasets. - Iterate and Refine: Start with a small sample of your data to test and refine your join criteria before running on the full dataset.
For additional best practices that apply to both Resolve and Equijoin operations, see the Best Practices section in the Resolve operation documentation.