Skip to content

OTEL plugin marks child-context spans as faults when they suspend (timed wait misclassified as FAILED) #491

Description

@yaythomas

Summary

In the experimental OpenTelemetry plugin (the durable SDK plugins parameter is still provisional and emits a FutureWarning), the instrumentation records a false fault whenever a durable execution suspends inside a child context. For example, a child context that calls context.wait(...) produces a child-context span that X-Ray marks as a fault, with a recorded exception named TimedSuspendExecution. The execution itself behaves correctly. A suspend is normal durable control flow, not a failure, so it should not produce a fault.

Impact

Every child context that contains a wait (or any other suspend) emits a false fault into the trace. This inflates error and fault counts in CloudWatch and X-Ray and pollutes the service map. Concurrent branches (map and parallel) run as child contexts, so they are affected the same way. Top-level waits and plain steps are not affected.

Reproduction

  1. Deploy a durable function with the OTEL plugin, the ADOT layer, and X-Ray active tracing.
  2. Use a handler with a child context that waits:
@durable_with_child_context
def child_workflow(ctx):
    ctx.step(do_work(), name="child-step-1")
    ctx.wait(Duration.from_seconds(5), name="child-wait")
    return ctx.step(do_more(), name="child-step-2")

@durable_execution(plugins=[OtelPlugin()])
def handler(event, context):
    return context.run_in_child_context(child_workflow(), name="child-context")
  1. Invoke once (a qualified version is required for durable functions) and open the trace in CloudWatch Traces.

Observed: the child-context span is marked fault: true with a recorded exception TimedSuspendExecution.
Expected: the child-context span closes cleanly with no fault, the same way the top-level invocation and wait spans do.

Root cause

The defect is in the core SDK instrumentation plumbing, not in the OTEL plugin. The plugin correctly renders whatever outcome it is handed.

  1. state.py wrap_user_function catches the suspend and reports it to plugins as an error whose type is the concrete subclass name:
except SuspendExecution as e:
    self._plugin_executor.on_user_function_end(
        start_info,
        ErrorObject(type=type(e).__name__, ...),  # "TimedSuspendExecution"
    )
    raise
  1. plugin.py UserFunctionOutcome.from_error classifies the outcome by matching the base class name only:
elif error.type == SuspendExecution.__name__:  # "SuspendExecution"
    return cls(cls.PENDING)
else:
    return cls(cls.FAILED)

"TimedSuspendExecution" != "SuspendExecution", so the outcome falls through to FAILED. Since TimedSuspendExecution is the type that every timed wait raises, the PENDING branch is never reached for the common case. The OTEL plugin then maps FAILED to set_status(ERROR) plus record_exception, which X-Ray shows as a fault.

The top-level path avoids this because a top-level suspend is handled at the invocation level and resolves to InvocationStatus.PENDING. Only the wrap_user_function path (child contexts, branches) hits the misclassification.

Why it is name-based

ErrorObject is the serialized form of an error and carries only a type name string, so from_error cannot do an isinstance check against the class hierarchy. Matching the base class name alone does not recognize subclasses such as TimedSuspendExecution.

Suggested fix

Decide the outcome where the live exception still exists (the except SuspendExecution clause), rather than reconstructing class identity from a name string downstream. A suspend should be reported with a PENDING outcome and no error. This is an internal-only change. The public plugin interface already exposes UserFunctionEndInfo.outcome, and PENDING already exists, so no public API change is needed. One concrete shape:

  • Add an internal PluginExecutor.on_user_function_suspend(start_info) that dispatches a UserFunctionEndInfo with outcome=PENDING and error=None through the existing public on_user_function_end.
  • Route the except SuspendExecution branch in wrap_user_function to it.
  • Simplify from_error to None -> SUCCEEDED else FAILED, removing the fragile name match.

Affected versions

The plumbing dates to the plugin interface (PR #371). The behavior is present on current main.

Metadata

Metadata

Assignees

Labels

otel-pluginrelated to the otel-plugin package

Type

No fields configured for Bug.

Projects

Status
In review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions