Skip to content

Code Operations

Code operations define transformations using Python code rather than LLM prompts. No LLM calls are made. Use them for processing that should be deterministic, math-based, or built on existing Python libraries.

In YAML, code is a string of Python source defining a transform function. In the Python API, pass any callable as code, e.g., a lambda.

Types of Code Operations

Code Map Operation

The Code Map operation applies a Python function to each item in your input data independently.

Example Code Map Operation
- name: extract_keywords
  type: code_map
  code: |
    def transform(doc) -> dict:
        # Your transformation code here
        keywords = doc['text'].lower().split()
        return {
            'keywords': keywords,
            'keyword_count': len(keywords)
        }
import docetl

def extract_keywords(doc) -> dict:
    keywords = doc["text"].lower().split()
    return {"keywords": keywords, "keyword_count": len(keywords)}

frame = docetl.read_json("documents.json")
frame = frame.code_map(code=extract_keywords)
rows = frame.collect()

The code must define a transform function that takes a single document as input and returns a dictionary of transformed values.

Code Reduce Operation

The Code Reduce operation aggregates multiple items into a single result using a Python function.

Example Code Reduce Operation
- name: aggregate_stats
  type: code_reduce
  reduce_key: category
  code: |
    def transform(items) -> dict:
        total = sum(item['value'] for item in items)
        avg = total / len(items)
        return {
            'total': total,
            'average': avg,
            'count': len(items)
        }
import docetl

def aggregate_stats(items) -> dict:
    total = sum(item["value"] for item in items)
    return {"total": total, "average": total / len(items), "count": len(items)}

frame = docetl.read_json("data.json")
frame = frame.code_reduce(reduce_key="category", code=aggregate_stats)
rows = frame.collect()

The transform function for reduce operations takes a list of items as input and returns a single aggregated result.

Code Filter Operation

The Code Filter operation allows you to filter items based on custom Python logic.

Example Code Filter Operation
- name: filter_valid_entries
  type: code_filter  
  code: |
    def transform(doc) -> bool:
        # Return True to keep the document, False to filter it out
        return doc['score'] >= 0.5 and len(doc['text']) > 100
import docetl

frame = docetl.read_json("entries.json")
frame = frame.code_filter(
    code=lambda doc: doc["score"] >= 0.5 and len(doc["text"]) > 100
)
rows = frame.collect()

The transform function should return True for items to keep and False for items to filter out.

Configuration

Required Parameters

  • type: Must be "code_map", "code_reduce", or "code_filter"
  • code: The transform. In YAML, a string of Python source defining a transform function; in the Python API, any callable (or the same string form). For map, the function takes a single document and returns a dictionary. For reduce, it takes a list of documents and returns a single aggregated dictionary. For filter, it takes a single document and returns a boolean indicating whether to keep it.

Optional Parameters

Parameter Description Default
drop_keys List of keys to remove from output (code_map only) None
reduce_key Key(s) to group by (code_reduce only) "_all"
pass_through Pass through unmodified keys from first item in group (code_reduce only) false
concurrent_thread_count The number of threads to start the number of logical CPU cores (os.cpu_count())
limit Maximum number of outputs to produce before stopping Processes all data

The limit parameter behaves differently for each operation type:

  • code_map: Limits the number of input documents to process
  • code_filter: Limits the number of documents that pass the filter (outputs)
  • code_reduce: Limits the number of groups to reduce, selecting the smallest groups first (by document count)