Skip to content

Add initial version of a style guide for substrate control plane.#351

Open
Julian Gutierrez Oschmann (juli4n) wants to merge 3 commits into
agent-substrate:mainfrom
juli4n:rework_api
Open

Add initial version of a style guide for substrate control plane.#351
Julian Gutierrez Oschmann (juli4n) wants to merge 3 commits into
agent-substrate:mainfrom
juli4n:rework_api

Conversation

@juli4n

Copy link
Copy Markdown
Collaborator

Add an initial version of a style guide for the substrate public API. It uses Google's AIP as a starting point, and clarified when we intentionally deviate from it.

Comment thread docs/api-style-guide.md Outdated

@EItanya Eitan Yarmush (EItanya) left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about linking to this from the agents.md or a similar file so that agents are always aware to keep these rules in mind?

Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md Outdated

### 2.3 Character constraints

Both `atespace` and `name` must conform to [DNS-1123 label](https://tools.ietf.org/html/rfc1123) syntax:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referencing of DNS has become somewhat confusing in k8s, since DNS technically is case-insensitive. I think we should instead define this as a "resource name" and simply say that all resources in agent substrate are named as per the following rules, which makes them compatible with RFC-1123 DNS labels.

In k8s we were TOO flexible. I would rather add flexibility only when needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Done.

Comment thread docs/api-style-guide.md Outdated
Comment thread docs/api-style-guide.md Outdated
```proto
message Actor {
// worker_name references the global-scoped Worker assigned to this actor.
string worker_name = 10;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no, for symmetry, define a message anyway?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. I would be in favor of that, not so much because of symmetry, but because it allows us to pass dedicated ref types through the call stack.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread docs/api-style-guide.md Outdated
Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md Outdated
*Follows [AIP-135](https://google.aip.dev/135).*

```proto
rpc DeleteActor(DeleteActorRequest) returns (google.protobuf.Empty) {}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to return the last value when successful?

Comment thread docs/api-style-guide.md Outdated
Comment thread docs/api-style-guide.md Outdated
// Only present on resources that support soft delete.
google.protobuf.Timestamp delete_time = 6;

// version is incremented on every mutation. See §7.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to increment by 1 or to increase by an arbitrary amount? The latter would have made some things in k8s easier.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We increment by one today. My intuition would be to not make it part of the contract. Today the only usage is to establish a total order on "snapshots" of a resource (e.g. in worker cache). That only requires being monotonically increasing, not sequential.

Comment thread docs/api-style-guide.md Outdated
Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md Outdated
*Follows [AIP-135](https://google.aip.dev/135).*

```proto
rpc DeleteActor(DeleteActorRequest) returns (google.protobuf.Empty) {}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In k8s, we had to graft on a delete response type because it turns out to be super useful for our lifecycles. In k8s a resource deletion is not immediate and so tracking the post-delete lifecycle becomes interesting/complicated

I don't know what the plans are for deletion/deletion-hooks in this system, but something you could consider future-proofing against?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified a bit the delete story. Changed to return the last version of the resource. I'm sure we will iterate on this, for now delete is synchronous but I suspect we will switch to an async (ala k8s) model soon.

Comment thread docs/api-style-guide.md Outdated
google.protobuf.Timestamp update_time = 5;

// delete_time is set when the resource is soft-deleted.
// Only present on resources that support soft delete.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detail on soft delete in the section about delete would have been enlightening for me, I didn't realize it was possible/supported until I read this part..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more details.

Comment thread docs/api-style-guide.md

*Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.*

When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In k8s, we initially set up resources so clients could opt-in to optimistic concurrency (for k8s it was an RV, here it would be "version"). We ended up regretting that and in newer APIs we require optimistic-concurrency. It's not clear to me what choice you've made here, but I wanted to raise that as something we believe we got wrong initially.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that all resources will have a version field. If a client does a naive read-modify-write, the optimistic concurrency check will trigger, because the update endpoint will honor it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for RMW (but does the field mask allow the user to ignore the version field?)

For things like "patch" - should we require version and/or UID preconditions?

For things like Delete() - should we require version and/or UID preconditions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide talks about version preconditions for Update/Delete. It doesn't mention UID preconditions, but it mentions how we can extend it to support it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Should we add UID preconditions now, later, or never?

b) Should we require any preconditions for any ops or is it totally up to users?

Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md Outdated
}
```

- `atespace` is the Substrate namespace for the resource.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like you're taking roughly the same approach k8s did to namespaces. Is this done because you need to map to k8s concepts? Would a list of things we regret with the global + flat namespaces approach be useful to consider?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IT's mostly because the problem-spaces are similar. But yes,more lessons. I know a lot about the issues around namespaces, but I bet you know more :)

@jpbetz Joe Betz (jpbetz) Jun 30, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a shot! Not all of these have consensus. So feel free to disagree on any/all of them. I consider them worth discussion, at the very least.

Global/Namespace scope split issues:

  • ClusterRoleBinding and RoleBinding are basically the same resource, but one is global scoped and the other is namespace scoped. This exemplifies that SOME resources are useful in both scopes and having a hard split between the scopes limits composition. Other examples: SubjectAccessReview vs LocalSubjectAccessReview. Note how the naming is inconsistent :( (This has proliferated into the ecosystem. E.g. ClusterIssuer vs Issuer in cert-manager).
  • Resources such as ValidatingAdmissionPolicy are global scope but have a field that limits the scope of use to a namespace (matchResources.namespaceSelector in the case of ValidatingAdmissionPolicy). This exemplifies resources that are resources lack a namespace scoped version (LocalValidatingAdmissionPolicy !?), that would allow the resource to be defined within the scope it influences.
  • We lack global ResourceQuota and LimitRange for no good reason other than the inconvenience of having to define them separately
  • IngressClass (global) relates to an Ingress (namespaced) and has an explicit scope field (set to namespace or cluster) plus a namespace field.. So even though it's global scope, it can be assigned a namespace via fields...

Reference issues:

  • PersistentVolume (global) is bound 1:1 with a PVC) (namespaced) via a reference from global scope to namespace scope. This exmplifies the need to have references from global scope to namespace scope (PV.spec.claimRef.. (APIService -> Service does this too)

  • Secret and ConfigMap are namespace scoped but often are needed across namespaces, but can't be referenced cleanly so we end up with a lot of copying and odd plumbing. Both are arguably needed at the cluster level.

Organizational issues:

  • Multiple "Namespace as a tenant" project have more or less failed due to the inability to configure a namespace from within the namespace. Attempts to graft on hierarchical namespaces have hit similar issues (as well as some issues specific to hierarchical namespaces).

My personal takeaways:

  • Deciding "should this be global or namespaced" a priori is hard to get right. Some (all ?!) resources should be both.
  • If we had hierarchical resources, then "global" would presumably just be the root namespace.
  • While a hierarchical namespace system would have been a lot more complicated to implement (and may very well have been intractable early on during k8s development), the presence of the hierarchy would have served as a forcing function to address a lot of the problems that have emerged as a result of the global/namespace split...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the notes! Nothing here is exactly "news" to me, though some were reminders of how messy it can get :)

Deciding "should this be global or namespaced" a priori is hard to get right. Some (all ?!) resources should be both.

What prevents us from using the same Kind on different URL paths in k8s? I know the answer is partly related to how Kind and Resource are mapped, but that seems like a problem of our own making.

If we had hierarchical resources, then "global" would presumably just be the root namespace.

Hierarchy is a singular concept. It allows you to define a single "relationship" but makes other relationships harder. E.g. "who can access a resource" is not necessarily the same as "who pays for the resource" nor "what quota is applied to this resource". I say this as someone who STILL thinks that kube should have at least one extra level of hierarchy, so ...

We're looking at a different sort of API here - RPC oriented (even if we have resources) and narrower (for now!). It's hard to make assertions about what we will or won't need in the future and keep a straight face - our ability to see the future has a poor track record. I wonder how we can best leave room for expansion here? For example, we could define atespaces as hierarchical but only allow a single layer for now. E.g. "" is a the root space ("global"), while "foo" is just the first layer in hierarchy and MAYBE EVENTUALLY "foo/bar/baz/qux" is nested. That has a lot of ripple on things like auth and quota and networking. If they do not implement as hierarchical, retrofitting it will be hard (or we might pick semantics that are impossible).

Is this something worth investing in now?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents us from using the same Kind on different URL paths in k8s? I know the answer is partly related to how Kind and Resource are mapped, but that seems like a problem of our own making.

It's 100% of our own making.

Is this something worth investing in now?

I don't know enough to take a strong stance on the hierarchy-or-not decision, but if you don't provide hierarchy, I'd encourage you to still consider a mental model of "root namespace and one level of namespaces under that" instead of "global and namespaced". If you squint it's the same thing, but if you consider the compositional requirements that trickle down from this, you could maybe avoid some of the mistakes k8s made.

Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md
@@ -0,0 +1,457 @@
# Substrate gRPC API Style Guide

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have (or want to have) the idea of stability levels and/or feature gating/enablement, either at the resource level or field level?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but it's something I want to define soon!

Comment thread docs/api-style-guide.md Outdated
rpc SuspendActor(SuspendActorRequest) returns (Actor) {}

message SuspendActorRequest {
ActorRef actor = 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you have refs, how do you manage deletion such that resources don't become orphaned?

Comment thread docs/api-style-guide.md Outdated
```proto
message Worker {
// The name of this worker, globally unique (e.g. the node name).
string name = 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a resource with a given name is deleted, can it be created again? How are you planning on dealing with the per-lifecycle identity?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can. Each resource will have a UID field which is server side generated, which is the per-lifecycle identity.

Comment thread docs/api-style-guide.md

// Pagination token for the next page.
// Empty if this is the last page.
string next_page_token = 2;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absence of a field with the "total count" or "remaining item count" is notable. Is this deliberate? This can be expensive to compute. If you're deliberately choosing not to provide this, I recommend making the decision explicit here and including the rationale so that you fully lock in the decision.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was deliberate, yes, as you said, depending on how the data model is designed it could be hard to compute. Added.

Comment thread docs/api-style-guide.md
Comment thread docs/api-style-guide.md

message DeleteActorRequest {
ActorRef actor = 1;
int64 version = 2; // optional freshness guard; 0 = skip check

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should add optional UID precodition here

message Preconditions {
    int64 version = 1; 0 = do not compare
    string uid = 2; "" = do not compare
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how strong you feel about this. In the current version I'm proposing top level fields for this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the same preconditions available for all resources, right? So we will either dedup them in a struct or embed them in every DeleteFooRequest -- struct seems better (same for dry run an other CreateOptions, etc)?

@thockin

Copy link
Copy Markdown
Collaborator

Not exactly related to style/convention, but something we might consider:

a problem in k8s is "history". I sometimes want to delete something but later find it is useful to cross-reference. A long time back we talked about having a "hardlink" model where you could read by UID or by name. When you deleted an object it could still be found by its uid for X time.

Is that something we want to think about?

* Reword what a valid resource name is.
* Remove authz related sections.
* Remove confusing text about renaming resources.
Comment thread docs/api-style-guide.md
```proto
message Actor {
// The atespace this actor belongs to.
ResourceMetadata meta = 1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpick - "metadata" to be like k8s?

Comment thread docs/api-style-guide.md


message ResourceMetadata {
// The atespace this resource belongs to.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that this is "" for global resources?

Comment thread docs/api-style-guide.md
### 2.3 Character constraints

Both `atespace` and `name` must be valid resource names.
A valid resource name must comply the following rules:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"comply with"

Comment thread docs/api-style-guide.md

message DeleteActorRequest {
ObjectRef actor = 1;
int64 version = 2;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the version out here and just add // ...other fields, otherwise it is confusing

Comment thread docs/api-style-guide.md

- Do not embed the full resource message as a reference field.
- Do not use a single combined string like `"atespace/name"`. Callers would have to parse it.
- Do not use a plain `string {resource}_name` field for global-scoped references — use `ObjectRef` for consistency and type safety.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - often refs will be within the same atespace. Are we expected to fill in the atespace field or leave it empty and validate that it is not specified?

Comment thread docs/api-style-guide.md
- RPC name **must** begin with `Delete` followed by the singular resource name.
- Response **must** be the deleted resource.
- Request **must** identify the resource with an `ObjectRef` field (for both atespace-scoped and global-scoped resources).
- If the resource does not exist: return `NOT_FOUND`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes Delete non-idempotent. Do we think this matters?

Comment thread docs/api-style-guide.md
- RPC name **must** begin with `Update` followed by the singular resource name.
- Response **must** be the resource itself — not an `UpdateActorResponse` wrapper.
- `update_mask` **must** be of type `google.protobuf.FieldMask` and **must** be named `update_mask`.
- `update_mask` is **required**. An absent or empty mask **must** return `INVALID_ARGUMENT`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if output-only fields are included in the mask? Ignored or error?

Comment thread docs/api-style-guide.md
// Immutable throughout the lifecycle of the resource.
string uid = 3;

// version is incremented on every mutation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say "increased" rather than incremented? That leaves more room

Comment thread docs/api-style-guide.md
google.protobuf.Timestamp create_time = 5;

// update_time is the time the resource was last updated by a user action.
google.protobuf.Timestamp update_time = 6;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume delete_time comes when we have soft delete?

Comment thread docs/api-style-guide.md

*Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.*

When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Should we add UID preconditions now, later, or never?

b) Should we require any preconditions for any ops or is it totally up to users?

Eitan Yarmush (EItanya) added a commit to kagent-dev/substrate that referenced this pull request Jul 2, 2026
Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate,
WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes
CRDs), following the draft API style guide (agent-substrate#351), plus design
rationale and CRD migration notes. For review only: codegen and
server implementation intentionally lag the proto.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eitan Yarmush (EItanya) added a commit to kagent-dev/substrate that referenced this pull request Jul 2, 2026
Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate,
WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes
CRDs), following the draft API style guide (agent-substrate#351), plus design
rationale and CRD migration notes. For review only: codegen and
server implementation intentionally lag the proto.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by-:
Eitan Yarmush (EItanya) added a commit to kagent-dev/substrate that referenced this pull request Jul 2, 2026
Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate,
WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes
CRDs), following the draft API style guide (agent-substrate#351), plus design
rationale and CRD migration notes. For review only: codegen and
server implementation intentionally lag the proto.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants