Best Practices for DocETL

This guide outlines key best practices for using DocETL effectively, focusing on the most important aspects of pipeline creation, execution, and optimization.

Pipeline Design

Start Simple: Begin with a basic pipeline and gradually add complexity as needed.
Modular Design: Break down complex tasks into smaller, manageable operations.
Optimize Incrementally: Optimize one operation at a time to ensure stability and verify improvements.

Schema and Prompt Design

Keep Schemas Simple: Use simple output schemas whenever possible. Complex nested structures can be difficult for LLMs to produce consistently.
Clear and Concise Prompts: Write clear, concise prompts for LLM operations, providing relevant context from input data. Instruct quantities (e.g., 2-3 insights, one summary) to guide the LLM.
Take advantage of Jinja Templating: Use Jinja templating to dynamically generate prompts and provide context to the LLM. Feel free to use if statements, loops, and other Jinja features to customize prompts.
Validate Outputs: Use the validate field to ensure the quality and correctness of processed data. This consists of Python statements that validate the output and optionally retry the LLM if one or more statements fail.

Handling Large Documents and Entity Resolution

Chunk Large Inputs: For documents exceeding token limits, consider using the optimizer to automatically chunk inputs.
Use Resolve Operations: Implement resolve operations before reduce operations when dealing with similar entities. Take care to write the compare prompts well to guide the LLM--often the optimizer-synthesized prompts are too generic.

Optimization and Execution

Use the Optimizer: Leverage DocETL's optimizer for complex pipelines or when dealing with large documents.
Leverage Caching: Take advantage of DocETL's caching mechanism to avoid redundant computations.
Monitor Resource Usage: Keep an eye on API costs and processing time, especially when optimizing.