Schemas
In DocETL, schemas play an important role in defining the structure of output from LLM operations. Every LLM call in DocETL is associated with an output schema, which specifies the expected format and types of the output data.
Overview
- Schemas define the structure and types of output data from LLM operations.
- They help ensure consistency and facilitate downstream processing.
- DocETL uses structured outputs or tool API to enforce these schemas.
Schema Simplicity
We've observed that the more complex the output schema is, the worse the quality of the output tends to be. Keep your schemas as simple as possible for better results.
Defining Schemas
Schemas are defined in the output
section of an operator. They support various data types:
Type | Aliases | Description |
---|---|---|
string |
str , text , varchar |
For text data |
integer |
int |
For whole numbers |
number |
float , decimal |
For decimal numbers |
boolean |
bool |
For true/false values |
enum |
- | For a set of possible values |
list |
- | For arrays or sequences of items (must specify element type) |
Objects | - | Using notation {field: type} |
Filter Operation Schemas
Filter operation schemas must have a boolean type output field. This is used to determine whether each item should be included or excluded based on the filter criteria.
Examples
Simple Schema
output:
schema:
summary: string
sentiment: string
include_item: boolean # For filter operations
Complex Schema
output:
schema:
insights: "list[{insight: string, confidence: number}]"
metadata: "{timestamp: string, source: string}"
Lists and Objects
Lists in schemas must specify their element type:
list[string]
: A list of stringslist[int]
: A list of integerslist[{name: string, age: integer}]
: A list of objects
Objects are defined using curly braces and must have typed fields:
{name: string, age: integer, is_active: boolean}
Complex List Example
output:
schema:
users: "list[{name: string, age: integer, hobbies: list[string]}]"
Make sure that you put the type in quotation marks, if it references an object type (i.e., has curly braces)! Otherwise the yaml won't compile!
Enum Types
You can also specify enum types, which will be validated against a set of possible values. Suppose we have an operation to extract sentiments from a document, and we want to ensure that the sentiment is one of the three possible values. Our schema would look like this:
output:
schema:
sentiment: "enum[positive, negative, neutral]"
You can also specify a list of enum types (say, if we wanted to extract multiple sentiments from a document):
output:
schema:
possible_sentiments: "list[enum[positive, negative, neutral]]"
How We Enforce Schemas
DocETL uses structured outputs or tool API to enforce schema typing. This ensures that the LLM outputs adhere to the specified schema, making the results more consistent and easier to process in subsequent operations.
DocETL supports two output modes that determine how the LLM generates structured outputs:
Tools Mode (Default)
Uses the OpenAI tools/function calling API to enforce schema structure. This is the default mode and provides robust schema validation.
output:
schema:
summary: string
sentiment: string
mode: "tools" # Optional - this is the default
Structured Output Mode
Uses LiteLLM's structured output feature with JSON schema validation. This mode can provide more reliable schema adherence for complex outputs.
output:
schema:
insights: "list[{insight: string, confidence: number}]"
mode: "structured_output"
When to Use Structured Output Mode
Consider using structured_output
mode when:
- You have complex nested schemas with lists and objects
- You need more consistent schema adherence
- You're experiencing schema validation issues with tools mode
The tools mode remains the default and works well for most use cases.
Mode Configuration
The output mode can be configured in the output
section of any operation:
operations:
- name: analyze_text
type: map
prompt: "Analyze the following text..."
output:
schema:
topics: "list[{topic: string, relevance: number}]"
mode: "structured_output" # or "tools"
model: gpt-4o-mini
If no mode is specified, DocETL defaults to "tools"
mode for backward compatibility.
Best Practices
- Keep output fields simple and use string types whenever possible.
- Only use structured fields (like lists and objects) when necessary for downstream analysis or reduce operations.
- If you need to reference structured fields in downstream operations, consider breaking complex structures into multiple simpler operations.
Schema Optimization
If you find your schema becoming too complex, consider breaking it down into multiple operations. This can improve both the quality of LLM outputs and the manageability of your pipeline.
Breaking Down Complex Schemas
Instead of:
output:
schema:
summary: string
key_points: "list[{point: string, sentiment: string}]"
Consider:
output:
schema:
summary: string
key_points: "string"
Where in the prompt you can say something like: In your key points, please include the sentiment of each point.
The only reason to use the complex schema is if you need to do an operation at the point level, like resolve them and reduce on them.
By following these guidelines and best practices, you can create effective schemas that enhance the performance and reliability of your DocETL operations.