Detect CRASHED Actors during Checkpoint and Restore#353
Detect CRASHED Actors during Checkpoint and Restore#353Zoe Zhao (zoez7) wants to merge 4 commits into
CRASHED Actors during Checkpoint and Restore#353Conversation
2dcac0b to
11ae3bf
Compare
CRASHED state to ActorCRASHED Actors during Checkpoint and Restore
| }, | ||
| ) | ||
| if derr != nil { | ||
| return status.Error(grpcCode, err.Error()) |
There was a problem hiding this comment.
it silently downgrades to a plain status without ErrorInfo. WithDetails on *epb.ErrorInfo essentially never fails in practice, but if it ever will, the reason would silently disappear and a real crash would be misclassified as a transient error. An slog.Error on the fallback would catch any regression that breaks the round-trip.
There was a problem hiding this comment.
Done in 1183ed3 . Added slog.Err and appended the error reason to the message in case WithDetails fail.
| } | ||
| actor.Status = ateapipb.Actor_STATUS_CRASHED | ||
| // TODO(zoezhao): Mark the worker as unhealthy if needed. | ||
| actor.AteomPodNamespace = "" |
There was a problem hiding this comment.
could you please add TODO, that we need to keep the machine where the actor was running if CRASH came from the resume of PAUSED state. The #119 has a future concept of ate actor dump that might can recover some data of actor when the image was stored locally.
There was a problem hiding this comment.
Done in 1183ed3 . Added TODOs to handle bad worker state as well as preserving node info.
| return err | ||
| } | ||
|
|
||
| if actor.GetStatus() == ateapipb.Actor_STATUS_CRASHED { |
There was a problem hiding this comment.
Julian Gutierrez Oschmann (@juli4n) is this is a pattern you want to keep to check the state before transition? We discussed similar behavior for other APIs, like pause already suspended actor.
There was a problem hiding this comment.
This was just a adhoc change for the CRASHED state, but I think a better pattern would be to add a CheckPrerequisite() step here: https://github.com/agent-substrate/substrate/blame/b755ae7388814f05418169f986cda0921cd48b03/cmd/ateapi/internal/controlapi/workflow.go#L42 to add validations like this, wdyt?
| return state.Actor.GetStatus() == ateapipb.Actor_STATUS_PAUSED, nil | ||
| } | ||
| func (s *CallAteletPauseStep) Execute(ctx context.Context, input *PauseInput, state *PauseState) error { | ||
| if state.Actor.GetStatus() != ateapipb.Actor_STATUS_PAUSING { |
There was a problem hiding this comment.
Julian Gutierrez Oschmann (@juli4n) same question here.
| } | ||
|
|
||
| if state.Actor.GetAteomPodNamespace() == "" { | ||
| return fmt.Errorf("actor is in PAUSING state but has no active worker") |
There was a problem hiding this comment.
Julian Gutierrez Oschmann (@juli4n) - I think this is something that I discussed with you last time when I developed this code. Is it error or we need to move to CRASH state here too.
There was a problem hiding this comment.
IMO this should also result in a CRASHED actor, because if we are trying to Pause an actor, and found out we don't know which Ateom it is running on, either
- The Actor is actually running on some Ateom, but we are not able to snapshot, in which case we have lost Actor's dirty state, and the only thing user can do is to revert to a previous snapshot.
- The Actor was not running on any Ateom, but somehow is in Running or Suspending state. Running
actor commitwill restore it to SUSPENDED.
There was a problem hiding this comment.
Done in 1183ed3 . Crashed the Actor if we are trying to suspend or pause it, but couldn't find the worker assignment.
| }) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("while calling ateom.CheckpointWorkload: %w", err) | ||
| return nil, ateerrors.NewGRPCError(codes.DataLoss, ateerrors.ErrReasonCrashActor, err) |
There was a problem hiding this comment.
is it always supposed to be crashActor?
| case ateletpb.CheckpointType_CHECKPOINT_TYPE_EXTERNAL: | ||
| if err := s.uploadExternalCheckpoint(ctx, req, checkpointDir, sandboxRec); err != nil { | ||
| return nil, err | ||
| // TODO: If we can cache the snapshot locally when it fails to upload, we won't have to crash the Actor right away. |
There was a problem hiding this comment.
This is an interesting case. Could you please create an issue describing this case, otherwise we will never back to it.
Dmitry Berkovich (dberkov)
left a comment
There was a problem hiding this comment.
- minor comments.
- could you please update microvm ateom to support same behavior as for gvisor.
- E2E is failing, i scheduled retry and it still fails. Could you please run e2e on your local environment.
… cannot find Worker assignment.
fdfb203 to
1183ed3
Compare
… terminal errors in atelet with local sentinels
Related to #292 and #119
Changes:
ResumeActorwe'll put the actor in CRASHED state if the Resume failed due to any errors related to the snapshot itself, which includes:SuspendActororPauseActorwe'll put the actor in CRASHED state if we cannot produce a valid snapshot, which includes: