Tutorial: Mining User Behavior Data with DocETL
This tutorial will guide you through the process of using DocETL to analyze user behavior data from UI logs. We'll create a simple pipeline that extracts key insights and supporting actions from user logs, then summarizes them by country.
Installation
First, let's install DocETL. Follow the instructions in the installation guide to set up DocETL on your system.
Setting up API Keys
DocETL uses LiteLLM under the hood, which supports various LLM providers. For this tutorial, we'll use OpenAI, as DocETL tests and existing pipelines are run with OpenAI.
Setting up your API Key
Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your_api_key_here
Alternatively, you can create a .env
file in your project directory and add the following line:
OPENAI_API_KEY=your_api_key_here
OpenAI Dependency
DocETL has been primarily tested with OpenAI's language models and relies heavily on their structured output capabilities. While we aim to support other providers in the future, using OpenAI is currently recommended for the best experience and most reliable results.
If you choose to use a different provider, be aware that you may encounter unexpected behavior or reduced functionality, especially with operations that depend on structured outputs.
Preparing the Data
Organize your user behavior data in a JSON file as a list of objects. Each object should have the following keys: "user_id", "country", and "log". The "log" field contains the user interaction logs.
Sample Data Structure
[
{
"user_id": "user123",
"country": "USA",
"log": "[2023-06-15 09:15:23] User opened app\n[2023-06-15 09:16:05] User clicked on 'Products' tab\n[2023-06-15 09:16:30] User viewed product 'Laptop X'\n[2023-06-15 09:18:45] User added 'Laptop X' to cart\n[2023-06-15 09:19:10] User proceeded to checkout\n[2023-06-15 09:25:37] User completed purchase\n42333 more tokens..."
},
{
"user_id": "user456",
"country": "Canada",
"log": "[2023-06-15 14:30:12] User launched app\n[2023-06-15 14:31:03] User searched for 'wireless headphones'\n[2023-06-15 14:32:18] User applied price filter\n[2023-06-15 14:33:00] User viewed product 'Headphone Y'\n[2023-06-15 14:38:22] User exited app without purchase\n13238 more tokens..."
}
]
Save this file as user_logs.json
in your project directory.
Creating the Pipeline
Now, let's create a DocETL pipeline to analyze this data. We'll use a map-reduce-like approach:
- Map each user log to key insights and supporting actions
- Unnest the insights
- Reduce by country to summarize insights and identify common patterns
Create a file named pipeline.yaml
with the following structure:
Pipeline Structure
-
Define the dataset
datasets: user_logs: type: file path: "user_logs.json"
-
Extract insights (map operation)
- name: extract_insights type: map prompt: | Analyze the following user interaction log: {{ input.log }} Extract 2-3 main insights from this log, each being 1-2 words, to help inform future product development. Consider any difficulties or pain points the user may have had. Also provide 1-2 supporting actions for each insight. Return the results as a list of dictionaries, each containing 'insight' and 'supporting_actions' keys. output: schema: insights: "list[{insight: string, supporting_actions: string}]"
-
Unnest insights (unnest operation)
- name: unnest_insights type: unnest unnest_key: insights recursive: true
-
Summarize by country (reduce operation)
- name: summarize_by_country type: reduce reduce_key: country prompt: | Summarize the user behavior insights for the country: {{ inputs[0].country }} Insights and supporting actions: {% for item in inputs %} - Insight: {{ item.insight }} Supporting actions: {% for action in item.supporting_actions %} - {{ action }} {% endfor %} {% endfor %} Provide a summary of common insights and notable behaviors of users from this country. output: schema: detailed_summary: string
-
Define the pipeline steps
pipeline: steps: - name: analyze_user_logs input: user_logs operations: - extract_insights - unnest_insights - summarize_by_country
-
Specify the output
output: type: file path: "country_summaries.json"
Full Pipeline Configuration
default_model: gpt-4o-mini
datasets:
user_logs:
type: file
path: "user_logs.json"
operations:
- name: extract_insights
type: map
prompt: |
Analyze the following user interaction log:
{{ input.log }}
Extract 2-3 main insights from this log, each being 1-2 words, to help inform future product development. Consider any difficulties or pain points the user may have had. Also provide 1-2 supporting actions for each insight.
Return the results as a list of dictionaries, each containing 'insight' and 'supporting_actions' keys.
output:
schema:
insights: "list[{insight: string, supporting_actions: string}]"
- name: unnest_insights
type: unnest
unnest_key: insights
recursive: true
- name: summarize_by_country
type: reduce
reduce_key: country
prompt: |
Summarize the user behavior insights for the country: {{ inputs[0].country }}
Insights and supporting actions:
{% for item in inputs %}
- Insight: {{ item.insight }}
Supporting actions:
{% for action in item.supporting_actions %}
- {{ action }}
{% endfor %}
{% endfor %}
Provide a summary of common insights and notable behaviors of users from this country.
output:
schema:
detailed_summary: string
pipeline:
steps:
- name: analyze_user_logs
input: user_logs
operations:
- extract_insights
- unnest_insights
- summarize_by_country
output:
type: file
path: "country_summaries.json"
Running the Pipeline
To execute the pipeline, run the following command in your terminal:
docetl run pipeline.yaml
This will process the user logs, extract key insights and supporting actions, and generate summaries for each country, saving the results in country_summaries.json
.
Further Questions
What if I want to reduce by insights or an LLM-generated field?
You can modify the reduce operation to use any field as the reduce key, including LLM-generated fields from prior operations. Simply change the reduce_key
in the summarize_by_country
operation to the desired field. Note that we may need to perform entity resolution on the LLM-generated fields, which DocETL can do for you in the optimization process (to be discussed later).
How do I know what pipeline configuration to write? Can't I do this all in one map operation?
While it's possible to perform complex operations in a single map step, breaking down the process into multiple steps often leads to more maintainable and flexible pipelines. To learn more about optimizing your pipeline configuration, read on to discover DocETL's optimizer, which can be invoked using DocETL build
instead of docetl run
.