Pointing to External Data and Custom Parsing
In DocETL, you have full control over your dataset JSONs. These JSONs typically contain objects with key-value pairs, where you can reference external files that you want to process in your pipeline. This referencing mechanism, which we call "pointing", allows DocETL to locate and process external files that require special handling before they can be used in your main pipeline.
Why Use Custom Parsing?
Consider these scenarios where custom parsing of referenced files is beneficial:
- Your dataset JSON references Excel spreadsheets containing sales data.
- You have entries pointing to scanned receipts in PDF format that need OCR processing.
- You want to extract text from Word documents or PowerPoint presentations by referencing their file locations.
In these cases, custom parsing enables you to transform your raw external data into a format that DocETL can process effectively within your pipeline. The pointing mechanism allows DocETL to locate these external files and apply custom parsing seamlessly. (Pointing in DocETL refers to the practice of including references or paths to external files within your dataset JSON. Instead of embedding the entire content of these files, you simply "point" to their locations, allowing DocETL to access and process them as needed during the pipeline execution.)
Dataset JSON Example
Let's look at a typical dataset JSON file that you might create:
[
{ "id": 1, "excel_path": "sales_data/january_sales.xlsx" },
{ "id": 2, "excel_path": "sales_data/february_sales.xlsx" }
]
In this example, you've specified paths to Excel files. DocETL will use these paths to locate and process the external files. However, without custom parsing, DocETL wouldn't know how to handle the contents of these files. This is where parsing tools come in handy.
Custom Parsing in Action
1. Configuration
To use custom parsing, you need to define parsing tools in your DocETL configuration file. Here's an example:
parsing_tools:
- name: top_products_report
function_code: |
def top_products_report(document: Dict) -> List[Dict]:
import pandas as pd
# Read the Excel file
filename = document["excel_path"]
df = pd.read_excel(filename)
# Calculate total sales
total_sales = df['Sales'].sum()
# Find top 500 products by sales
top_products = df.groupby('Product')['Sales'].sum().nlargest(500)
# Calculate month-over-month growth
df['Date'] = pd.to_datetime(df['Date'])
monthly_sales = df.groupby(df['Date'].dt.to_period('M'))['Sales'].sum()
mom_growth = monthly_sales.pct_change().fillna(0)
# Prepare the analysis report
report = [
f"Total Sales: ${total_sales:,.2f}",
"\nTop 500 Products by Sales:",
top_products.to_string(),
"\nMonth-over-Month Growth:",
mom_growth.to_string()
]
# Return a list of dicts representing the output
# The input document will be merged into each output doc,
# so we can access all original fields from the input doc.
return [{"sales_analysis": "\n".join(report)}]
datasets:
sales_reports:
type: file
source: local
path: "sales_data/sales_paths.json"
parsing:
- function: top_products_report
receipts:
type: file
source: local
path: "receipts/receipt_paths.json"
parsing:
- input_key: pdf_path
function: paddleocr_pdf_to_string
output_key: receipt_text
ocr_enabled: true
lang: "en"
In this configuration:
- We define a custom
top_products_report
function for Excel files. - We use the built-in
paddleocr_pdf_to_string
parser for PDF files. - We apply these parsing tools to the external files referenced in the respective datasets.
2. Pipeline Integration
Once you've defined your parsing tools and datasets, you can use the processed data in your pipeline:
pipeline:
steps:
- name: process_sales
input: sales_reports
operations:
- summarize_sales
- name: process_receipts
input: receipts
operations:
- extract_receipt_info
This pipeline will use the parsed data from both Excel files and PDFs for further processing.
How Data Gets Parsed and Formatted
When you run your DocETL pipeline, the parsing tools you've specified in your configuration file are applied to the external files referenced in your dataset JSONs. Here's what happens:
- DocETL reads your dataset JSON file.
- For each entry in the dataset, it looks at the parsing configuration you've specified.
- It applies the appropriate parsing function to the file path provided in the dataset JSON.
- The parsing function processes the file and returns the data in a format DocETL can work with (typically a list of strings).
Let's look at how this works for our earlier examples:
Excel Files (using top_products_report)
For an Excel file like "sales_data/january_sales.xlsx":
- The top_products_report function reads the Excel file.
- It processes the sales data and generates a report of top-selling products.
- The output might look like this:
Top Products Report - January 2023
1. Widget A - 1500 units sold
2. Gadget B - 1200 units sold
3. Gizmo C - 950 units sold
4. Doohickey D - 800 units sold
5. Thingamajig E - 650 units sold
...
Total Revenue: $245,000
Best Selling Category: Electronics
PDF Files (using paddleocr_pdf_to_string)
For a PDF file like "receipts/receipt001.pdf":
- The paddleocr_pdf_to_string function reads the PDF file.
- It uses PaddleOCR to perform optical character recognition on each page.
- The function combines the extracted text from all pages into a single string. The output might look like this:
RECEIPT
Store: Example Store
Date: 2023-05-15
Items:
1. Product A - $10.99
2. Product B - $15.50
3. Product C - $7.25
4. Product D - $22.00
Subtotal: $55.74
Tax (8%): $4.46
Total: $60.20
Payment Method: Credit Card
Card Number: \***\* \*\*** \*\*\*\* 1234
Thank you for your purchase!
This parsed and formatted data is then passed to the respective operations in your pipeline for further processing.
Running the Pipeline
Once you've set up your pipeline configuration file with the appropriate parsing tools and dataset definitions, you can run your DocETL pipeline. Here's how:
- Ensure you have DocETL installed in your environment.
- Open a terminal or command prompt.
- Navigate to the directory containing your pipeline configuration file.
- Run the following command:
docetl run pipeline.yaml
Replace pipeline.yaml
with the name of your pipeline file if it's different.
When you run this command:
- DocETL reads your pipeline file.
- It processes each dataset using the specified parsing tools.
- The pipeline steps are executed in the order you defined.
- Any operations you've specified (like
summarize_sales
orextract_receipt_info
) are applied to the parsed data. - The results are saved according to your output configuration.
Built-in Parsing Tools
DocETL provides several built-in parsing tools to handle common file formats and data processing tasks. These tools can be used directly in your configuration by specifying their names in the function
field of your parsing tools configuration. Here's an overview of the available built-in parsing tools:
Convert an Excel file to a string representation or a list of string representations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
|
Path to the xlsx file. |
required |
orientation
|
str
|
Either "row" or "col" for cell arrangement. |
'col'
|
col_order
|
Optional[List[str]]
|
List of column names to specify the order. |
None
|
doc_per_sheet
|
bool
|
If True, return a list of strings, one per sheet. |
False
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: String representation(s) of the Excel file content. |
Source code in docetl/parsing_tools.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
options: show_root_heading: true heading_level: 3
Read the content of a text file and return it as a list of strings (only one element).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
|
Path to the txt or md file. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: Content of the file as a list of strings. |
Source code in docetl/parsing_tools.py
156 157 158 159 160 161 162 163 164 165 166 167 168 |
|
options: show_root_heading: true heading_level: 3
Extract text from a Word document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
|
Path to the docx file. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: Extracted text from the document. |
Source code in docetl/parsing_tools.py
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
options: show_root_heading: true heading_level: 3
Transcribe speech from an audio file to text using Whisper model via litellm. If the file is larger than 25 MB, it's split into 10-minute chunks with 30-second overlap.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
|
Path to the mp3 or mp4 file. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: Transcribed text. |
Source code in docetl/parsing_tools.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
options: show_root_heading: true heading_level: 3
Extract text from a PowerPoint presentation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
|
Path to the pptx file. |
required |
doc_per_slide
|
bool
|
If True, return each slide as a separate document. If False, return the entire presentation as one document. |
False
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: Extracted text from the presentation. If doc_per_slide is True, each string in the list represents a single slide. Otherwise, the list contains a single string with all slides' content. |
Source code in docetl/parsing_tools.py
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
|
options: show_root_heading: true heading_level: 3
Note to developers: We used this documentation as a reference.
This function uses Azure Document Intelligence to extract text from documents. To use this function, you need to set up an Azure Document Intelligence resource:
- Create an Azure account if you don't have one
- Set up a Document Intelligence resource in the Azure portal
- Once created, find the resource's endpoint and key in the Azure portal
- Set these as environment variables:
- DOCUMENTINTELLIGENCE_API_KEY: Your Azure Document Intelligence API key
- DOCUMENTINTELLIGENCE_ENDPOINT: Your Azure Document Intelligence endpoint URL
The function will use these credentials to authenticate with the Azure service. If the environment variables are not set, the function will raise a ValueError.
The Azure Document Intelligence client is then initialized with these credentials. It sends the document (either as a file or URL) to Azure for analysis. The service processes the document and returns structured information about its content.
This function then extracts the text content from the returned data, applying any specified formatting options (like including line numbers or font styles). The extracted text is returned as a list of strings, with each string representing either a page (if doc_per_page is True) or the entire document.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
|
Path to the file to be analyzed or URL of the document if use_url is True. |
required |
use_url
|
bool
|
If True, treat filename as a URL. Defaults to False. |
False
|
include_line_numbers
|
bool
|
If True, include line numbers in the output. Defaults to False. |
False
|
include_handwritten
|
bool
|
If True, include handwritten text in the output. Defaults to False. |
False
|
include_font_styles
|
bool
|
If True, include font style information in the output. Defaults to False. |
False
|
include_selection_marks
|
bool
|
If True, include selection marks in the output. Defaults to False. |
False
|
doc_per_page
|
bool
|
If True, return each page as a separate document. Defaults to False. |
False
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: Extracted text from the document. If doc_per_page is True, each string in the list represents a single page. Otherwise, the list contains a single string with all pages' content. |
Raises:
Type | Description |
---|---|
ValueError
|
If DOCUMENTINTELLIGENCE_API_KEY or DOCUMENTINTELLIGENCE_ENDPOINT environment variables are not set. |
Source code in docetl/parsing_tools.py
|
|
options: heading_level: 3 show_root_heading: true
Extract text and image information from a PDF file using PaddleOCR for image-based PDFs.
Note: this is very slow!!
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path
|
str
|
Path to the input PDF file. |
required |
doc_per_page
|
bool
|
If True, return a list of strings, one per page. If False, return a single string. |
False
|
ocr_enabled
|
bool
|
Whether to enable OCR for image-based PDFs. |
True
|
lang
|
str
|
Language of the PDF file. |
'en'
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: Extracted content as a list of formatted strings. |
Source code in docetl/parsing_tools.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 |
|
options: heading_level: 3 show_root_heading: true
Using Function Arguments with Parsing Tools
When using parsing tools in your DocETL configuration, you can pass additional arguments to the parsing functions.
For example, when using the xlsx_to_string parsing tool, you can specify options like the orientation of the data, the order of columns, or whether to process each sheet separately. Here's an example of how to use such kwargs in your configuration:
datasets:
my_sales:
type: file
source: local
path: "sales_data/sales_paths.json"
parsing_tools:
- name: excel_parser
function: xlsx_to_string
orientation: row
col_order: ["Date", "Product", "Quantity", "Price"]
doc_per_sheet: true
Contributing Built-in Parsing Tools
While DocETL provides several built-in parsing tools, the community can always benefit from additional utilities. If you've developed a parsing tool that you think could be useful for others, consider contributing it to the DocETL repository. Here's how you can add new built-in parsing utilities:
- Fork the DocETL repository on GitHub.
- Clone your forked repository to your local machine.
- Navigate to the
docetl/parsing_tools.py
file. - Add your new parsing function to this file. The function should also be added to the
PARSING_TOOLS
dictionary. - Update the documentation in the function's docstring.
- Create a pull request to merge your changes into the main DocETL repository.
Guidelines for Contributing Parsing Tools
When contributing a new parsing tool, make sure it follows these guidelines:
- The function should have a clear, descriptive name.
- Include comprehensive docstrings explaining the function's purpose, parameters, and return value. The return value should be a list of strings.
- Handle potential errors gracefully and provide informative error messages.
- If your parser requires additional dependencies, make sure to mention them in the pull request.
Creating Custom Parsing Tools
If the built-in tools don't meet your needs, you can create your own custom parsing tools. Here's how:
- Define your parsing function in the
parsing_tools
section of your configuration. - Ensure your function takes a item (dict) as input and returns a list of items (dicts).
- Use your custom parser in the
parsing
section of your dataset configuration.
For example:
parsing_tools:
- name: my_custom_parser
function_code: |
def my_custom_parser(item: Dict) -> List[Dict]:
# Your custom parsing logic here
return [processed_data]
datasets:
my_dataset:
type: file
source: local
path: "data/paths.json"
parsing:
- function: my_custom_parser