Detect `CRASHED` Actors during Checkpoint and Restore by zoez7 · Pull Request #353 · agent-substrate/substrate

Zoe Zhao (zoez7) · 2026-06-29T21:55:35Z

Related to #292 and #119

Changes:

ResumeActor:
1. During ResumeActor we'll put the actor in CRASHED state if the Resume failed due to any errors related to the snapshot itself, which includes:
  1. Missing or corrupted snapshot when downloading from GCS
  2. Runsc command fails.
2. All other types of errors will not put the Actor in CRASHED state.
SuspendActor or PauseActor
1. During SuspendActor or PauseActor we'll put the actor in CRASHED state if we cannot produce a valid snapshot, which includes:
  1. cannot connect to Ateom after retries,
  2. runsc command fails.
  3. cannot upload the snapshot to GCS (during suspend) ; or cannot save snapshot locally (during pause).
2. All other types of errors will not put the Actor in CRASHED state.

Dmitry Berkovich (dberkov) · 2026-06-30T05:07:36Z

+		},
+	)
+	if derr != nil {
+		return status.Error(grpcCode, err.Error())


it silently downgrades to a plain status without ErrorInfo. WithDetails on *epb.ErrorInfo essentially never fails in practice, but if it ever will, the reason would silently disappear and a real crash would be misclassified as a transient error. An slog.Error on the fallback would catch any regression that breaks the round-trip.

Done in 1183ed3 . Added slog.Err and appended the error reason to the message in case WithDetails fail.

Dmitry Berkovich (dberkov) · 2026-06-30T05:11:38Z

+	}
+	actor.Status = ateapipb.Actor_STATUS_CRASHED
+	// TODO(zoezhao): Mark the worker as unhealthy if needed.
+	actor.AteomPodNamespace = ""


could you please add TODO, that we need to keep the machine where the actor was running if CRASH came from the resume of PAUSED state. The #119 has a future concept of ate actor dump that might can recover some data of actor when the image was stored locally.

Done in 1183ed3 . Added TODOs to handle bad worker state as well as preserving node info.

Dmitry Berkovich (dberkov) · 2026-06-30T05:15:21Z

 		return err
 	}
+
+	if actor.GetStatus() == ateapipb.Actor_STATUS_CRASHED {


Julian Gutierrez Oschmann (@juli4n) is this is a pattern you want to keep to check the state before transition? We discussed similar behavior for other APIs, like pause already suspended actor.

This was just a adhoc change for the CRASHED state, but I think a better pattern would be to add a CheckPrerequisite() step here: https://github.com/agent-substrate/substrate/blame/b755ae7388814f05418169f986cda0921cd48b03/cmd/ateapi/internal/controlapi/workflow.go#L42 to add validations like this, wdyt?

Dmitry Berkovich (dberkov) · 2026-06-30T05:17:39Z

 	return state.Actor.GetStatus() == ateapipb.Actor_STATUS_PAUSED, nil
 }
 func (s *CallAteletPauseStep) Execute(ctx context.Context, input *PauseInput, state *PauseState) error {
+	if state.Actor.GetStatus() != ateapipb.Actor_STATUS_PAUSING {


Julian Gutierrez Oschmann (@juli4n) same question here.

Dmitry Berkovich (dberkov) · 2026-06-30T05:18:39Z

+	}
+
 	if state.Actor.GetAteomPodNamespace() == "" {
 		return fmt.Errorf("actor is in PAUSING state but has no active worker")


Julian Gutierrez Oschmann (@juli4n) - I think this is something that I discussed with you last time when I developed this code. Is it error or we need to move to CRASH state here too.

IMO this should also result in a CRASHED actor, because if we are trying to Pause an actor, and found out we don't know which Ateom it is running on, either

The Actor is actually running on some Ateom, but we are not able to snapshot, in which case we have lost Actor's dirty state, and the only thing user can do is to revert to a previous snapshot.

The Actor was not running on any Ateom, but somehow is in Running or Suspending state. Running actor commit will restore it to SUSPENDED.

Done in 1183ed3 . Crashed the Actor if we are trying to suspend or pause it, but couldn't find the worker assignment.

Dmitry Berkovich (dberkov) · 2026-06-30T05:20:51Z

 	})
 	if err != nil {
-		return nil, fmt.Errorf("while calling ateom.CheckpointWorkload: %w", err)
+		return nil, ateerrors.NewGRPCError(codes.DataLoss, ateerrors.ErrReasonCrashActor, err)


is it always supposed to be crashActor?

Dmitry Berkovich (dberkov) · 2026-06-30T05:22:57Z

 	case ateletpb.CheckpointType_CHECKPOINT_TYPE_EXTERNAL:
 		if err := s.uploadExternalCheckpoint(ctx, req, checkpointDir, sandboxRec); err != nil {
-			return nil, err
+			// TODO: If we can cache the snapshot locally when it fails to upload, we won't have to crash the Actor right away.


This is an interesting case. Could you please create an issue describing this case, otherwise we will never back to it.

Created #362

Dmitry Berkovich (dberkov)

minor comments.
could you please update microvm ateom to support same behavior as for gvisor.
E2E is failing, i scheduled retry and it still fails. Could you please run e2e on your local environment.

… cannot find Worker assignment.

… terminal errors in atelet with local sentinels

Zoe Zhao (zoez7) force-pushed the crashed branch 8 times, most recently from 2dcac0b to 11ae3bf Compare June 30, 2026 00:01

Zoe Zhao (zoez7) changed the title ~~Add CRASHED state to Actor~~ Detect CRASHED Actors during Checkpoint and Restore Jun 30, 2026

Zoe Zhao (zoez7) requested a review from Julian Gutierrez Oschmann (juli4n) June 30, 2026 00:04

Zoe Zhao (zoez7) marked this pull request as ready for review June 30, 2026 00:04

Zoe Zhao (zoez7) requested a review from Dmitry Berkovich (dberkov) June 30, 2026 00:04

Dmitry Berkovich (dberkov) reviewed Jun 30, 2026

View reviewed changes

Comment thread internal/ateompath/ateompath.go Outdated

Dmitry Berkovich (dberkov) requested changes Jun 30, 2026

View reviewed changes

Zoe Zhao (zoez7) added 2 commits June 30, 2026 15:14

initial draft for crash actor

c8faa27

Log error when WithDetails fail. Crash actor when Checkpointing if we…

1183ed3

… cannot find Worker assignment.

Zoe Zhao (zoez7) force-pushed the crashed branch from fdfb203 to 1183ed3 Compare June 30, 2026 22:17

Fix ST1005 lint: lowercase error string in ateerrors fallback

2fb648b

Zoe Zhao (zoez7) force-pushed the crashed branch from 85d36d8 to 4f4d89c Compare July 1, 2026 23:52

Split CRASH_ACTOR into per-cause Reasons + actorCrashed metadata; tag…

c1005dc

… terminal errors in atelet with local sentinels

Zoe Zhao (zoez7) force-pushed the crashed branch from 4f4d89c to c1005dc Compare July 2, 2026 00:02

Uh oh!

Conversation

Zoe Zhao (zoez7) commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dmitry Berkovich (dberkov) left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zoe Zhao (zoez7) commented Jun 29, 2026 •

edited

Loading