Running the Optimizer
Optimizer Stability
The optimization process can be unstable, as well as resource-intensive (we've seen it take up to 10 minutes to optimize a single operation, spending up to ~$50 in API costs for end-to-end pipelines). We recommend optimizing one operation at a time and retrying if necessary, as results may vary between runs. This approach also allows you to confidently verify that each optimized operation is performing as expected before moving on to the next.
See the API for more details on how to resume the optimizer from a failed run, by rerunning docetl build pipeline.yaml --resume (with the --resume flag).
Also, you can use gpt-4o-mini for cheaper optimizations (rather than the default gpt-4o), which you can do via docetl build pipeline.yaml --model=gpt-4o-mini.
To optimize your pipeline, start with your initial configuration and follow these steps:
-
Set
optimize: Truefor the operation you want to optimize (start with the first operation, if you're not sure which one). -
Run the optimizer using the command
docetl build pipeline.yaml. This will generate an optimized version inpipeline_opt.yaml. -
Review the optimized operation in
pipeline_opt.yaml. If you're satisfied with the changes, copy the optimized operation back into your originalpipeline.yaml. -
Move on to the next LLM-powered operation and repeat steps 1-3.
-
Once all operations are optimized, your
pipeline.yamlwill contain the fully optimized pipeline.
When optimizing a resolve operation, the optimizer will also set blocking configurations and thresholds, saving you from manual configuration.
Feeling Ambitious?
You can run the optimizer on your entire pipeline by setting optimize: True for each operation you want to optimize. But sometimes the agent fails to find a better plan, and you'll need to manually intervene. We are exploring human-in-the-loop optimization, where the optimizer can ask for human feedback to improve its plans.
Example: Optimizing a Medical Transcripts Pipeline
Let's walk through optimizing a pipeline for extracting medication information from medical transcripts. We'll start with an initial pipeline and optimize it step by step.
Initial Pipeline
datasets:
transcripts:
path: medical_transcripts.json
type: file
default_model: gpt-4o-mini
operations:
- name: extract_medications
type: map
optimize: true
output:
schema:
medication: list[str]
prompt: |
Analyze the transcript: {{ input.src }}
List all medications mentioned.
- name: unnest_medications
type: unnest
unnest_key: medication
- name: summarize_prescriptions
type: reduce
optimize: true
reduce_key:
- medication
output:
schema:
side_effects: str
uses: str
prompt: |
Summarize side effects and uses of {{ reduce_key }} from:
{% for value in inputs %}
Transcript {{ loop.index }}: {{ value.src }}
{% endfor %}
pipeline:
output:
path: medication_summaries.json
type: file
steps:
- input: transcripts
name: medical_info_extraction
operations:
- extract_medications
- unnest_medications
- summarize_prescriptions
Optimization Steps
First, we'll optimize the extract_medications operation. Set optimize: True for this operation and run the optimizer. Review the changes and integrate them into your pipeline.
Then, optimize the summarize_prescriptions operation by setting optimize: True and running docetl build pipeline.yaml again. The optimizer may suggest adding a resolve operation at this point, and will automatically configure blocking and thresholds. After completing all steps, your optimized pipeline might look like this:
Optimized Pipeline
datasets:
transcripts:
path: medical_transcripts.json
type: file
default_model: gpt-4o-mini
operations:
- name: extract_medications
type: map
output:
schema:
medication: list[str]
prompt: |
Analyze the transcript: {{ input.src }}
List all medications mentioned.
gleaning:
num_rounds: 1
validation_prompt: |
Evaluate the extraction for completeness and accuracy:
1. Are all medications, dosages, and symptoms from the transcript included?
2. Is the extracted information correct and relevant?
- name: unnest_medications
type: unnest
unnest_key: medication
- name: resolve_medications
type: resolve
blocking_keys:
- medication
blocking_threshold: 0.7
comparison_prompt: |
Compare medications:
1: {{ input1.medication }}
2: {{ input2.medication }}
Are these the same or closely related?
resolution_prompt: |
Standardize the name for:
{% for entry in inputs %}
- {{ entry.medication }}
{% endfor %}
- name: summarize_prescriptions
type: reduce
reduce_key:
- medication
output:
schema:
side_effects: str
uses: str
prompt: |
Summarize side effects and uses of {{ reduce_key }} from:
{% for value in inputs %}
Transcript {{ loop.index }}: {{ value.src }}
{% endfor %}
fold_batch_size: 10
fold_prompt: |
Update the existing summary of side effects and uses for {{ reduce_key }} based on the following additional transcripts:
{% for value in inputs %}
Transcript {{ loop.index }}: {{ value.src }}
{% endfor %}
Existing summary:
Side effects: {{ output.side_effects }}
Uses: {{ output.uses }}
Provide an updated and comprehensive summary, incorporating both the existing information and any new insights from the additional transcripts.
pipeline:
output:
path: medication_summaries.json
type: file
steps:
- input: transcripts
name: medical_info_extraction
operations:
- extract_medications
- unnest_medications
- resolve_medications
- summarize_prescriptions
This optimized pipeline now includes improved prompts, a resolve operation, and additional output fields for more comprehensive medication information extraction.
Feedback Welcome
We're continually improving the optimizer. Your feedback on its performance and usability is invaluable. Please share your experiences and suggestions!
Optimizer API
docetl.cli.build(yaml_file=typer.Argument(..., help='Path to the YAML file containing the pipeline configuration'), optimizer=typer.Option('moar', '--optimizer', '-o', help="Optimizer to use: 'moar' (default) or 'v1' (deprecated)"), max_threads=typer.Option(None, help='Maximum number of threads to use for running operations'), resume=typer.Option(False, help='Resume optimization from a previous build that may have failed'), save_path=typer.Option(None, help='Path to save the optimized pipeline configuration'))
Build and optimize the configuration specified in the YAML file. Any arguments passed here will override the values in the YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
yaml_file
|
Path
|
Path to the YAML file containing the pipeline configuration. |
Argument(..., help='Path to the YAML file containing the pipeline configuration')
|
optimizer
|
str
|
Optimizer to use - 'moar' or 'v1' (required). |
Option('moar', '--optimizer', '-o', help="Optimizer to use: 'moar' (default) or 'v1' (deprecated)")
|
max_threads
|
int | None
|
Maximum number of threads to use for running operations. |
Option(None, help='Maximum number of threads to use for running operations')
|
resume
|
bool
|
Whether to resume optimization from a previous run. Defaults to False. |
Option(False, help='Resume optimization from a previous build that may have failed')
|
save_path
|
Path
|
Path to save the optimized pipeline configuration. |
Option(None, help='Path to save the optimized pipeline configuration')
|
Source code in docetl/cli.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |