Skip to content

Detect CRASHED Actors during Checkpoint and Restore#353

Open
Zoe Zhao (zoez7) wants to merge 4 commits into
agent-substrate:mainfrom
zoez7:crashed
Open

Detect CRASHED Actors during Checkpoint and Restore#353
Zoe Zhao (zoez7) wants to merge 4 commits into
agent-substrate:mainfrom
zoez7:crashed

Conversation

@zoez7

@zoez7 Zoe Zhao (zoez7) commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Related to #292 and #119

Changes:

  1. ResumeActor:
    1. During ResumeActor we'll put the actor in CRASHED state if the Resume failed due to any errors related to the snapshot itself, which includes:
      1. Missing or corrupted snapshot when downloading from GCS
      2. Runsc command fails.
    2. All other types of errors will not put the Actor in CRASHED state.
  2. SuspendActor or PauseActor
    1. During SuspendActor or PauseActor we'll put the actor in CRASHED state if we cannot produce a valid snapshot, which includes:
      1. cannot connect to Ateom after retries,
      2. runsc command fails.
      3. cannot upload the snapshot to GCS (during suspend) ; or cannot save snapshot locally (during pause).
    2. All other types of errors will not put the Actor in CRASHED state.

@zoez7 Zoe Zhao (zoez7) force-pushed the crashed branch 8 times, most recently from 2dcac0b to 11ae3bf Compare June 30, 2026 00:01
@zoez7 Zoe Zhao (zoez7) changed the title Add CRASHED state to Actor Detect CRASHED Actors during Checkpoint and Restore Jun 30, 2026
@zoez7 Zoe Zhao (zoez7) marked this pull request as ready for review June 30, 2026 00:04
Comment thread internal/ateerrors/ateerrors.go Outdated
},
)
if derr != nil {
return status.Error(grpcCode, err.Error())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it silently downgrades to a plain status without ErrorInfo. WithDetails on *epb.ErrorInfo essentially never fails in practice, but if it ever will, the reason would silently disappear and a real crash would be misclassified as a transient error. An slog.Error on the fallback would catch any regression that breaks the round-trip.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1183ed3 . Added slog.Err and appended the error reason to the message in case WithDetails fail.

Comment thread cmd/ateapi/internal/controlapi/crash.go Outdated
}
actor.Status = ateapipb.Actor_STATUS_CRASHED
// TODO(zoezhao): Mark the worker as unhealthy if needed.
actor.AteomPodNamespace = ""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add TODO, that we need to keep the machine where the actor was running if CRASH came from the resume of PAUSED state. The #119 has a future concept of ate actor dump that might can recover some data of actor when the image was stored locally.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1183ed3 . Added TODOs to handle bad worker state as well as preserving node info.

return err
}

if actor.GetStatus() == ateapipb.Actor_STATUS_CRASHED {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julian Gutierrez Oschmann (@juli4n) is this is a pattern you want to keep to check the state before transition? We discussed similar behavior for other APIs, like pause already suspended actor.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a adhoc change for the CRASHED state, but I think a better pattern would be to add a CheckPrerequisite() step here: https://github.com/agent-substrate/substrate/blame/b755ae7388814f05418169f986cda0921cd48b03/cmd/ateapi/internal/controlapi/workflow.go#L42 to add validations like this, wdyt?

return state.Actor.GetStatus() == ateapipb.Actor_STATUS_PAUSED, nil
}
func (s *CallAteletPauseStep) Execute(ctx context.Context, input *PauseInput, state *PauseState) error {
if state.Actor.GetStatus() != ateapipb.Actor_STATUS_PAUSING {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

if state.Actor.GetAteomPodNamespace() == "" {
return fmt.Errorf("actor is in PAUSING state but has no active worker")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julian Gutierrez Oschmann (@juli4n) - I think this is something that I discussed with you last time when I developed this code. Is it error or we need to move to CRASH state here too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should also result in a CRASHED actor, because if we are trying to Pause an actor, and found out we don't know which Ateom it is running on, either

  1. The Actor is actually running on some Ateom, but we are not able to snapshot, in which case we have lost Actor's dirty state, and the only thing user can do is to revert to a previous snapshot.
  2. The Actor was not running on any Ateom, but somehow is in Running or Suspending state. Running actor commit will restore it to SUSPENDED.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1183ed3 . Crashed the Actor if we are trying to suspend or pause it, but couldn't find the worker assignment.

Comment thread cmd/atelet/main.go Outdated
})
if err != nil {
return nil, fmt.Errorf("while calling ateom.CheckpointWorkload: %w", err)
return nil, ateerrors.NewGRPCError(codes.DataLoss, ateerrors.ErrReasonCrashActor, err)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always supposed to be crashActor?

Comment thread cmd/atelet/main.go Outdated
case ateletpb.CheckpointType_CHECKPOINT_TYPE_EXTERNAL:
if err := s.uploadExternalCheckpoint(ctx, req, checkpointDir, sandboxRec); err != nil {
return nil, err
// TODO: If we can cache the snapshot locally when it fails to upload, we won't have to crash the Actor right away.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case. Could you please create an issue describing this case, otherwise we will never back to it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #362

Comment thread internal/ateompath/ateompath.go Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. minor comments.
  2. could you please update microvm ateom to support same behavior as for gvisor.
  3. E2E is failing, i scheduled retry and it still fails. Could you please run e2e on your local environment.

… terminal errors in atelet with local sentinels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants