Add initial version of a style guide for substrate control plane. by juli4n · Pull Request #351 · agent-substrate/substrate

Julian Gutierrez Oschmann (juli4n) · 2026-06-29T20:04:47Z

Add an initial version of a style guide for the substrate public API. It uses Google's AIP as a starting point, and clarified when we intentionally deviate from it.

Eitan Yarmush (EItanya)

What do you think about linking to this from the agents.md or a similar file so that agents are always aware to keep these rules in mind?

Tim Hockin (thockin) · 2026-06-29T22:25:00Z

+
+### 2.3 Character constraints
+
+Both `atespace` and `name` must conform to [DNS-1123 label](https://tools.ietf.org/html/rfc1123) syntax:


The referencing of DNS has become somewhat confusing in k8s, since DNS technically is case-insensitive. I think we should instead define this as a "resource name" and simply say that all resources in agent substrate are named as per the following rules, which makes them compatible with RFC-1123 DNS labels.

In k8s we were TOO flexible. I would rather add flexibility only when needed.

Agree. Done.

Tim Hockin (thockin) · 2026-06-29T22:36:22Z

+```proto
+message Actor {
+  // worker_name references the global-scoped Worker assigned to this actor.
+  string worker_name = 10;


Why no, for symmetry, define a message anyway?

See my comment above. I would be in favor of that, not so much because of symmetry, but because it allows us to pass dedicated ref types through the call stack.

Tim Hockin (thockin) · 2026-06-29T23:09:08Z

+*Follows [AIP-135](https://google.aip.dev/135).*
+
+```proto
+rpc DeleteActor(DeleteActorRequest) returns (google.protobuf.Empty) {}


Is there a reason not to return the last value when successful?

Tim Hockin (thockin) · 2026-06-29T23:16:43Z

+  // Only present on resources that support soft delete.
+  google.protobuf.Timestamp delete_time = 6;
+
+  // version is incremented on every mutation. See §7.


Is this guaranteed to increment by 1 or to increase by an arbitrary amount? The latter would have made some things in k8s easier.

We increment by one today. My intuition would be to not make it part of the contract. Today the only usage is to establish a total order on "snapshots" of a resource (e.g. in worker cache). That only requires being monotonically increasing, not sequential.

Joe Betz (jpbetz) · 2026-06-29T23:49:49Z

+*Follows [AIP-135](https://google.aip.dev/135).*
+
+```proto
+rpc DeleteActor(DeleteActorRequest) returns (google.protobuf.Empty) {}


In k8s, we had to graft on a delete response type because it turns out to be super useful for our lifecycles. In k8s a resource deletion is not immediate and so tracking the post-delete lifecycle becomes interesting/complicated

I don't know what the plans are for deletion/deletion-hooks in this system, but something you could consider future-proofing against?

I clarified a bit the delete story. Changed to return the last version of the resource. I'm sure we will iterate on this, for now delete is synchronous but I suspect we will switch to an async (ala k8s) model soon.

Joe Betz (jpbetz) · 2026-06-29T23:53:37Z

+  google.protobuf.Timestamp update_time = 5;
+
+  // delete_time is set when the resource is soft-deleted.
+  // Only present on resources that support soft delete.


More detail on soft delete in the section about delete would have been enlightening for me, I didn't realize it was possible/supported until I read this part..

Added some more details.

Joe Betz (jpbetz) · 2026-06-29T23:58:17Z

+
+*Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.*
+
+When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes.


In k8s, we initially set up resources so clients could opt-in to optimistic concurrency (for k8s it was an RV, here it would be "version"). We ended up regretting that and in newer APIs we require optimistic-concurrency. It's not clear to me what choice you've made here, but I wanted to raise that as something we believe we got wrong initially.

The idea is that all resources will have a version field. If a client does a naive read-modify-write, the optimistic concurrency check will trigger, because the update endpoint will honor it.

That works for RMW (but does the field mask allow the user to ignore the version field?)

For things like "patch" - should we require version and/or UID preconditions?

For things like Delete() - should we require version and/or UID preconditions?

The style guide talks about version preconditions for Update/Delete. It doesn't mention UID preconditions, but it mentions how we can extend it to support it.

a) Should we add UID preconditions now, later, or never?

b) Should we require any preconditions for any ops or is it totally up to users?

Joe Betz (jpbetz) · 2026-06-30T00:06:36Z

+}
+```
+
+- `atespace` is the Substrate namespace for the resource.


Sounds like you're taking roughly the same approach k8s did to namespaces. Is this done because you need to map to k8s concepts? Would a list of things we regret with the global + flat namespaces approach be useful to consider?

IT's mostly because the problem-spaces are similar. But yes,more lessons. I know a lot about the issues around namespaces, but I bet you know more :)

I'll give it a shot! Not all of these have consensus. So feel free to disagree on any/all of them. I consider them worth discussion, at the very least.

Global/Namespace scope split issues:

ClusterRoleBinding and RoleBinding are basically the same resource, but one is global scoped and the other is namespace scoped. This exemplifies that SOME resources are useful in both scopes and having a hard split between the scopes limits composition. Other examples: SubjectAccessReview vs LocalSubjectAccessReview. Note how the naming is inconsistent :( (This has proliferated into the ecosystem. E.g. ClusterIssuer vs Issuer in cert-manager).

Resources such as ValidatingAdmissionPolicy are global scope but have a field that limits the scope of use to a namespace (matchResources.namespaceSelector in the case of ValidatingAdmissionPolicy). This exemplifies resources that are resources lack a namespace scoped version (LocalValidatingAdmissionPolicy !?), that would allow the resource to be defined within the scope it influences.

We lack global ResourceQuota and LimitRange for no good reason other than the inconvenience of having to define them separately

IngressClass (global) relates to an Ingress (namespaced) and has an explicit scope field (set to namespace or cluster) plus a namespace field.. So even though it's global scope, it can be assigned a namespace via fields...

Reference issues:

PersistentVolume (global) is bound 1:1 with a PVC) (namespaced) via a reference from global scope to namespace scope. This exmplifies the need to have references from global scope to namespace scope (PV.spec.claimRef.. (APIService -> Service does this too)

Secret and ConfigMap are namespace scoped but often are needed across namespaces, but can't be referenced cleanly so we end up with a lot of copying and odd plumbing. Both are arguably needed at the cluster level.

Organizational issues:

Multiple "Namespace as a tenant" project have more or less failed due to the inability to configure a namespace from within the namespace. Attempts to graft on hierarchical namespaces have hit similar issues (as well as some issues specific to hierarchical namespaces).

My personal takeaways:

Deciding "should this be global or namespaced" a priori is hard to get right. Some (all ?!) resources should be both.

If we had hierarchical resources, then "global" would presumably just be the root namespace.

While a hierarchical namespace system would have been a lot more complicated to implement (and may very well have been intractable early on during k8s development), the presence of the hierarchy would have served as a forcing function to address a lot of the problems that have emerged as a result of the global/namespace split...

Thanks for the notes! Nothing here is exactly "news" to me, though some were reminders of how messy it can get :)

Deciding "should this be global or namespaced" a priori is hard to get right. Some (all ?!) resources should be both.

What prevents us from using the same Kind on different URL paths in k8s? I know the answer is partly related to how Kind and Resource are mapped, but that seems like a problem of our own making.

If we had hierarchical resources, then "global" would presumably just be the root namespace.

Hierarchy is a singular concept. It allows you to define a single "relationship" but makes other relationships harder. E.g. "who can access a resource" is not necessarily the same as "who pays for the resource" nor "what quota is applied to this resource". I say this as someone who STILL thinks that kube should have at least one extra level of hierarchy, so ...

We're looking at a different sort of API here - RPC oriented (even if we have resources) and narrower (for now!). It's hard to make assertions about what we will or won't need in the future and keep a straight face - our ability to see the future has a poor track record. I wonder how we can best leave room for expansion here? For example, we could define atespaces as hierarchical but only allow a single layer for now. E.g. "" is a the root space ("global"), while "foo" is just the first layer in hierarchy and MAYBE EVENTUALLY "foo/bar/baz/qux" is nested. That has a lot of ripple on things like auth and quota and networking. If they do not implement as hierarchical, retrofitting it will be hard (or we might pick semantics that are impossible).

Is this something worth investing in now?

What prevents us from using the same Kind on different URL paths in k8s? I know the answer is partly related to how Kind and Resource are mapped, but that seems like a problem of our own making.

It's 100% of our own making.

Is this something worth investing in now?

I don't know enough to take a strong stance on the hierarchy-or-not decision, but if you don't provide hierarchy, I'd encourage you to still consider a mental model of "root namespace and one level of namespaces under that" instead of "global and namespaced". If you squint it's the same thing, but if you consider the compositional requirements that trickle down from this, you could maybe avoid some of the mistakes k8s made.

Joe Betz (jpbetz) · 2026-06-30T00:14:22Z

@@ -0,0 +1,457 @@
+# Substrate gRPC API Style Guide


Do you have (or want to have) the idea of stability levels and/or feature gating/enablement, either at the resource level or field level?

Not yet, but it's something I want to define soon!

Joe Betz (jpbetz) · 2026-06-30T00:15:27Z

+rpc SuspendActor(SuspendActorRequest) returns (Actor) {}
+
+message SuspendActorRequest {
+  ActorRef actor = 1;


Given that you have refs, how do you manage deletion such that resources don't become orphaned?

Joe Betz (jpbetz) · 2026-06-30T15:26:27Z

+```proto
+message Worker {
+  // The name of this worker, globally unique (e.g. the node name).
+  string name = 1;


After a resource with a given name is deleted, can it be created again? How are you planning on dealing with the per-lifecycle identity?

Yes, you can. Each resource will have a UID field which is server side generated, which is the per-lifecycle identity.

Joe Betz (jpbetz) · 2026-06-30T15:33:50Z

+
+  // Pagination token for the next page.
+  // Empty if this is the last page.
+  string next_page_token = 2;


The absence of a field with the "total count" or "remaining item count" is notable. Is this deliberate? This can be expensive to compute. If you're deliberately choosing not to provide this, I recommend making the decision explicit here and including the rationale so that you fully lock in the decision.

It was deliberate, yes, as you said, depending on how the data model is designed it could be hard to compute. Added.

Tim Hockin (thockin) · 2026-06-30T16:00:23Z

+
+message DeleteActorRequest {
+  ActorRef actor   = 1;
+  int64     version = 2; // optional freshness guard; 0 = skip check


probably should add optional UID precodition here

message Preconditions { int64 version = 1; 0 = do not compare string uid = 2; "" = do not compare }

Not sure how strong you feel about this. In the current version I'm proposing top level fields for this.

I think we want the same preconditions available for all resources, right? So we will either dedup them in a struct or embed them in every DeleteFooRequest -- struct seems better (same for dry run an other CreateOptions, etc)?

Tim Hockin (thockin) · 2026-06-30T16:51:58Z

Not exactly related to style/convention, but something we might consider:

a problem in k8s is "history". I sometimes want to delete something but later find it is useful to cross-reference. A long time back we talked about having a "hardlink" model where you could read by UID or by name. When you deleted an object it could still be found by its uid for X time.

Is that something we want to think about?

* Reword what a valid resource name is. * Remove authz related sections. * Remove confusing text about renaming resources.

Tim Hockin (thockin) · 2026-07-02T00:37:13Z

+```proto
+message Actor {
+  // The atespace this actor belongs to.
+  ResourceMetadata meta = 1;


super nitpick - "metadata" to be like k8s?

Tim Hockin (thockin) · 2026-07-02T00:38:48Z

+
+
+message ResourceMetadata {
+  // The atespace this resource belongs to.


Comment that this is "" for global resources?

Tim Hockin (thockin) · 2026-07-02T00:39:36Z

+### 2.3 Character constraints
+
+Both `atespace` and `name` must be valid resource names.
+A valid resource name must comply the following rules:


"comply with"

Tim Hockin (thockin) · 2026-07-02T00:40:48Z

+
+message DeleteActorRequest {
+  ObjectRef actor   = 1;
+  int64     version = 2;


I would leave the version out here and just add // ...other fields, otherwise it is confusing

Tim Hockin (thockin) · 2026-07-02T00:43:48Z

+
+- Do not embed the full resource message as a reference field.
+- Do not use a single combined string like `"atespace/name"`. Callers would have to parse it.
+- Do not use a plain `string {resource}_name` field for global-scoped references — use `ObjectRef` for consistency and type safety.


Question - often refs will be within the same atespace. Are we expected to fill in the atespace field or leave it empty and validate that it is not specified?

Tim Hockin (thockin) · 2026-07-02T00:53:24Z

+- RPC name **must** begin with `Delete` followed by the singular resource name.
+- Response **must** be the deleted resource.
+- Request **must** identify the resource with an `ObjectRef` field (for both atespace-scoped and global-scoped resources).
+- If the resource does not exist: return `NOT_FOUND`.


This makes Delete non-idempotent. Do we think this matters?

Tim Hockin (thockin) · 2026-07-02T01:06:33Z

+- RPC name **must** begin with `Update` followed by the singular resource name.
+- Response **must** be the resource itself — not an `UpdateActorResponse` wrapper.
+- `update_mask` **must** be of type `google.protobuf.FieldMask` and **must** be named `update_mask`.
+- `update_mask` is **required**. An absent or empty mask **must** return `INVALID_ARGUMENT`.


What happens if output-only fields are included in the mask? Ignored or error?

Tim Hockin (thockin) · 2026-07-02T01:07:12Z

+  // Immutable throughout the lifecycle of the resource.
+  string uid = 3;
+
+  // version is incremented on every mutation.


Can we say "increased" rather than incremented? That leaves more room

Tim Hockin (thockin) · 2026-07-02T01:07:48Z

+  google.protobuf.Timestamp create_time = 5;
+
+  // update_time is the time the resource was last updated by a user action.
+  google.protobuf.Timestamp update_time = 6;


I assume delete_time comes when we have soft delete?

Tim Hockin (thockin) · 2026-07-02T01:10:11Z

+
+*Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.*
+
+When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes.


a) Should we add UID preconditions now, later, or never?

b) Should we require any preconditions for any ops or is it totally up to users?

Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate, WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes CRDs), following the draft API style guide (agent-substrate#351), plus design rationale and CRD migration notes. For review only: codegen and server implementation intentionally lag the proto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate, WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes CRDs), following the draft API style guide (agent-substrate#351), plus design rationale and CRD migration notes. For review only: codegen and server implementation intentionally lag the proto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by-:

Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate, WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes CRDs), following the draft API style guide (agent-substrate#351), plus design rationale and CRD migration notes. For review only: codegen and server implementation intentionally lag the proto.

Haven Xia (HavenXia) previously approved these changes Jun 29, 2026

View reviewed changes

Comment thread docs/api-style-guide.md Outdated

Add initial version of a style guide for substrate control plane.

625d0f0

Julian Gutierrez Oschmann (juli4n) dismissed Haven Xia (HavenXia)’s stale review via 625d0f0 June 29, 2026 21:49

Julian Gutierrez Oschmann (juli4n) force-pushed the rework_api branch from 272abae to 625d0f0 Compare June 29, 2026 21:49

Eitan Yarmush (EItanya) reviewed Jun 29, 2026

View reviewed changes

Tim Hockin (thockin) reviewed Jun 29, 2026

View reviewed changes

Joe Betz (jpbetz) reviewed Jun 29, 2026

View reviewed changes

Comment thread docs/api-style-guide.md

Joe Betz (jpbetz) reviewed Jun 30, 2026

View reviewed changes

Tim Hockin (thockin) reviewed Jun 30, 2026

View reviewed changes

Comment thread docs/api-style-guide.md

Tim Hockin (thockin) reviewed Jun 30, 2026

View reviewed changes

A few changes to the style guide.

6d75446

* Reword what a valid resource name is. * Remove authz related sections. * Remove confusing text about renaming resources.

Julian Gutierrez Oschmann (juli4n) force-pushed the rework_api branch from 6c77820 to f50884f Compare July 1, 2026 21:57

Reflect some of the decisions discussed today 1:1.

b992e07

Julian Gutierrez Oschmann (juli4n) force-pushed the rework_api branch from f50884f to b992e07 Compare July 1, 2026 22:03

Tim Hockin (thockin) reviewed Jul 2, 2026

View reviewed changes

Eitan Yarmush (EItanya) mentioned this pull request Jul 2, 2026

design: expected final ateapi.proto for decoupled substrate resources (#368) #378

Draft


		### 2.3 Character constraints

		Both `atespace` and `name` must conform to [DNS-1123 label](https://tools.ietf.org/html/rfc1123) syntax:


		Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.

		When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes.



		message ResourceMetadata {
		// The atespace this resource belongs to.

Uh oh!

Conversation

Julian Gutierrez Oschmann (juli4n) commented Jun 29, 2026

Uh oh!

Uh oh!

Eitan Yarmush (EItanya) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Joe Betz (jpbetz) Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Joe Betz (jpbetz) Jun 30, 2026 •

edited

Loading