Add initial version of a style guide for substrate control plane.#351
Add initial version of a style guide for substrate control plane.#351Julian Gutierrez Oschmann (juli4n) wants to merge 3 commits into
Conversation
272abae to
625d0f0
Compare
Eitan Yarmush (EItanya)
left a comment
There was a problem hiding this comment.
What do you think about linking to this from the agents.md or a similar file so that agents are always aware to keep these rules in mind?
|
|
||
| ### 2.3 Character constraints | ||
|
|
||
| Both `atespace` and `name` must conform to [DNS-1123 label](https://tools.ietf.org/html/rfc1123) syntax: |
There was a problem hiding this comment.
The referencing of DNS has become somewhat confusing in k8s, since DNS technically is case-insensitive. I think we should instead define this as a "resource name" and simply say that all resources in agent substrate are named as per the following rules, which makes them compatible with RFC-1123 DNS labels.
In k8s we were TOO flexible. I would rather add flexibility only when needed.
There was a problem hiding this comment.
Agree. Done.
| ```proto | ||
| message Actor { | ||
| // worker_name references the global-scoped Worker assigned to this actor. | ||
| string worker_name = 10; |
There was a problem hiding this comment.
Why no, for symmetry, define a message anyway?
There was a problem hiding this comment.
See my comment above. I would be in favor of that, not so much because of symmetry, but because it allows us to pass dedicated ref types through the call stack.
There was a problem hiding this comment.
Done
| *Follows [AIP-135](https://google.aip.dev/135).* | ||
|
|
||
| ```proto | ||
| rpc DeleteActor(DeleteActorRequest) returns (google.protobuf.Empty) {} |
There was a problem hiding this comment.
Is there a reason not to return the last value when successful?
| // Only present on resources that support soft delete. | ||
| google.protobuf.Timestamp delete_time = 6; | ||
|
|
||
| // version is incremented on every mutation. See §7. |
There was a problem hiding this comment.
Is this guaranteed to increment by 1 or to increase by an arbitrary amount? The latter would have made some things in k8s easier.
There was a problem hiding this comment.
We increment by one today. My intuition would be to not make it part of the contract. Today the only usage is to establish a total order on "snapshots" of a resource (e.g. in worker cache). That only requires being monotonically increasing, not sequential.
| *Follows [AIP-135](https://google.aip.dev/135).* | ||
|
|
||
| ```proto | ||
| rpc DeleteActor(DeleteActorRequest) returns (google.protobuf.Empty) {} |
There was a problem hiding this comment.
In k8s, we had to graft on a delete response type because it turns out to be super useful for our lifecycles. In k8s a resource deletion is not immediate and so tracking the post-delete lifecycle becomes interesting/complicated
I don't know what the plans are for deletion/deletion-hooks in this system, but something you could consider future-proofing against?
There was a problem hiding this comment.
I clarified a bit the delete story. Changed to return the last version of the resource. I'm sure we will iterate on this, for now delete is synchronous but I suspect we will switch to an async (ala k8s) model soon.
| google.protobuf.Timestamp update_time = 5; | ||
|
|
||
| // delete_time is set when the resource is soft-deleted. | ||
| // Only present on resources that support soft delete. |
There was a problem hiding this comment.
More detail on soft delete in the section about delete would have been enlightening for me, I didn't realize it was possible/supported until I read this part..
There was a problem hiding this comment.
Added some more details.
|
|
||
| *Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.* | ||
|
|
||
| When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes. |
There was a problem hiding this comment.
In k8s, we initially set up resources so clients could opt-in to optimistic concurrency (for k8s it was an RV, here it would be "version"). We ended up regretting that and in newer APIs we require optimistic-concurrency. It's not clear to me what choice you've made here, but I wanted to raise that as something we believe we got wrong initially.
There was a problem hiding this comment.
The idea is that all resources will have a version field. If a client does a naive read-modify-write, the optimistic concurrency check will trigger, because the update endpoint will honor it.
There was a problem hiding this comment.
That works for RMW (but does the field mask allow the user to ignore the version field?)
For things like "patch" - should we require version and/or UID preconditions?
For things like Delete() - should we require version and/or UID preconditions?
There was a problem hiding this comment.
The style guide talks about version preconditions for Update/Delete. It doesn't mention UID preconditions, but it mentions how we can extend it to support it.
There was a problem hiding this comment.
a) Should we add UID preconditions now, later, or never?
b) Should we require any preconditions for any ops or is it totally up to users?
| } | ||
| ``` | ||
|
|
||
| - `atespace` is the Substrate namespace for the resource. |
There was a problem hiding this comment.
Sounds like you're taking roughly the same approach k8s did to namespaces. Is this done because you need to map to k8s concepts? Would a list of things we regret with the global + flat namespaces approach be useful to consider?
There was a problem hiding this comment.
IT's mostly because the problem-spaces are similar. But yes,more lessons. I know a lot about the issues around namespaces, but I bet you know more :)
There was a problem hiding this comment.
I'll give it a shot! Not all of these have consensus. So feel free to disagree on any/all of them. I consider them worth discussion, at the very least.
Global/Namespace scope split issues:
ClusterRoleBindingandRoleBindingare basically the same resource, but one is global scoped and the other is namespace scoped. This exemplifies that SOME resources are useful in both scopes and having a hard split between the scopes limits composition. Other examples:SubjectAccessReviewvsLocalSubjectAccessReview. Note how the naming is inconsistent :( (This has proliferated into the ecosystem. E.g.ClusterIssuervsIssuerin cert-manager).- Resources such as
ValidatingAdmissionPolicyare global scope but have a field that limits the scope of use to a namespace (matchResources.namespaceSelectorin the case ofValidatingAdmissionPolicy). This exemplifies resources that are resources lack a namespace scoped version (LocalValidatingAdmissionPolicy!?), that would allow the resource to be defined within the scope it influences. - We lack global
ResourceQuotaandLimitRangefor no good reason other than the inconvenience of having to define them separately IngressClass(global) relates to anIngress(namespaced) and has an explicitscopefield (set to namespace or cluster) plus a namespace field.. So even though it's global scope, it can be assigned a namespace via fields...
Reference issues:
-
PersistentVolume(global) is bound 1:1 with aPVC) (namespaced) via a reference from global scope to namespace scope. This exmplifies the need to have references from global scope to namespace scope (PV.spec.claimRef.. (APIService -> Service does this too) -
SecretandConfigMapare namespace scoped but often are needed across namespaces, but can't be referenced cleanly so we end up with a lot of copying and odd plumbing. Both are arguably needed at the cluster level.
Organizational issues:
- Multiple "Namespace as a tenant" project have more or less failed due to the inability to configure a namespace from within the namespace. Attempts to graft on hierarchical namespaces have hit similar issues (as well as some issues specific to hierarchical namespaces).
My personal takeaways:
- Deciding "should this be global or namespaced" a priori is hard to get right. Some (all ?!) resources should be both.
- If we had hierarchical resources, then "global" would presumably just be the root namespace.
- While a hierarchical namespace system would have been a lot more complicated to implement (and may very well have been intractable early on during k8s development), the presence of the hierarchy would have served as a forcing function to address a lot of the problems that have emerged as a result of the global/namespace split...
There was a problem hiding this comment.
Thanks for the notes! Nothing here is exactly "news" to me, though some were reminders of how messy it can get :)
Deciding "should this be global or namespaced" a priori is hard to get right. Some (all ?!) resources should be both.
What prevents us from using the same Kind on different URL paths in k8s? I know the answer is partly related to how Kind and Resource are mapped, but that seems like a problem of our own making.
If we had hierarchical resources, then "global" would presumably just be the root namespace.
Hierarchy is a singular concept. It allows you to define a single "relationship" but makes other relationships harder. E.g. "who can access a resource" is not necessarily the same as "who pays for the resource" nor "what quota is applied to this resource". I say this as someone who STILL thinks that kube should have at least one extra level of hierarchy, so ...
We're looking at a different sort of API here - RPC oriented (even if we have resources) and narrower (for now!). It's hard to make assertions about what we will or won't need in the future and keep a straight face - our ability to see the future has a poor track record. I wonder how we can best leave room for expansion here? For example, we could define atespaces as hierarchical but only allow a single layer for now. E.g. "" is a the root space ("global"), while "foo" is just the first layer in hierarchy and MAYBE EVENTUALLY "foo/bar/baz/qux" is nested. That has a lot of ripple on things like auth and quota and networking. If they do not implement as hierarchical, retrofitting it will be hard (or we might pick semantics that are impossible).
Is this something worth investing in now?
There was a problem hiding this comment.
What prevents us from using the same Kind on different URL paths in k8s? I know the answer is partly related to how Kind and Resource are mapped, but that seems like a problem of our own making.
It's 100% of our own making.
Is this something worth investing in now?
I don't know enough to take a strong stance on the hierarchy-or-not decision, but if you don't provide hierarchy, I'd encourage you to still consider a mental model of "root namespace and one level of namespaces under that" instead of "global and namespaced". If you squint it's the same thing, but if you consider the compositional requirements that trickle down from this, you could maybe avoid some of the mistakes k8s made.
| @@ -0,0 +1,457 @@ | |||
| # Substrate gRPC API Style Guide | |||
There was a problem hiding this comment.
Do you have (or want to have) the idea of stability levels and/or feature gating/enablement, either at the resource level or field level?
There was a problem hiding this comment.
Not yet, but it's something I want to define soon!
| rpc SuspendActor(SuspendActorRequest) returns (Actor) {} | ||
|
|
||
| message SuspendActorRequest { | ||
| ActorRef actor = 1; |
There was a problem hiding this comment.
Given that you have refs, how do you manage deletion such that resources don't become orphaned?
| ```proto | ||
| message Worker { | ||
| // The name of this worker, globally unique (e.g. the node name). | ||
| string name = 1; |
There was a problem hiding this comment.
After a resource with a given name is deleted, can it be created again? How are you planning on dealing with the per-lifecycle identity?
There was a problem hiding this comment.
Yes, you can. Each resource will have a UID field which is server side generated, which is the per-lifecycle identity.
|
|
||
| // Pagination token for the next page. | ||
| // Empty if this is the last page. | ||
| string next_page_token = 2; |
There was a problem hiding this comment.
The absence of a field with the "total count" or "remaining item count" is notable. Is this deliberate? This can be expensive to compute. If you're deliberately choosing not to provide this, I recommend making the decision explicit here and including the rationale so that you fully lock in the decision.
There was a problem hiding this comment.
It was deliberate, yes, as you said, depending on how the data model is designed it could be hard to compute. Added.
|
|
||
| message DeleteActorRequest { | ||
| ActorRef actor = 1; | ||
| int64 version = 2; // optional freshness guard; 0 = skip check |
There was a problem hiding this comment.
probably should add optional UID precodition here
message Preconditions {
int64 version = 1; 0 = do not compare
string uid = 2; "" = do not compare
}
There was a problem hiding this comment.
Not sure how strong you feel about this. In the current version I'm proposing top level fields for this.
There was a problem hiding this comment.
I think we want the same preconditions available for all resources, right? So we will either dedup them in a struct or embed them in every DeleteFooRequest -- struct seems better (same for dry run an other CreateOptions, etc)?
|
Not exactly related to style/convention, but something we might consider: a problem in k8s is "history". I sometimes want to delete something but later find it is useful to cross-reference. A long time back we talked about having a "hardlink" model where you could read by UID or by name. When you deleted an object it could still be found by its uid for X time. Is that something we want to think about? |
* Reword what a valid resource name is. * Remove authz related sections. * Remove confusing text about renaming resources.
6c77820 to
f50884f
Compare
f50884f to
b992e07
Compare
| ```proto | ||
| message Actor { | ||
| // The atespace this actor belongs to. | ||
| ResourceMetadata meta = 1; |
There was a problem hiding this comment.
super nitpick - "metadata" to be like k8s?
|
|
||
|
|
||
| message ResourceMetadata { | ||
| // The atespace this resource belongs to. |
There was a problem hiding this comment.
Comment that this is "" for global resources?
| ### 2.3 Character constraints | ||
|
|
||
| Both `atespace` and `name` must be valid resource names. | ||
| A valid resource name must comply the following rules: |
There was a problem hiding this comment.
"comply with"
|
|
||
| message DeleteActorRequest { | ||
| ObjectRef actor = 1; | ||
| int64 version = 2; |
There was a problem hiding this comment.
I would leave the version out here and just add // ...other fields, otherwise it is confusing
|
|
||
| - Do not embed the full resource message as a reference field. | ||
| - Do not use a single combined string like `"atespace/name"`. Callers would have to parse it. | ||
| - Do not use a plain `string {resource}_name` field for global-scoped references — use `ObjectRef` for consistency and type safety. |
There was a problem hiding this comment.
Question - often refs will be within the same atespace. Are we expected to fill in the atespace field or leave it empty and validate that it is not specified?
| - RPC name **must** begin with `Delete` followed by the singular resource name. | ||
| - Response **must** be the deleted resource. | ||
| - Request **must** identify the resource with an `ObjectRef` field (for both atespace-scoped and global-scoped resources). | ||
| - If the resource does not exist: return `NOT_FOUND`. |
There was a problem hiding this comment.
This makes Delete non-idempotent. Do we think this matters?
| - RPC name **must** begin with `Update` followed by the singular resource name. | ||
| - Response **must** be the resource itself — not an `UpdateActorResponse` wrapper. | ||
| - `update_mask` **must** be of type `google.protobuf.FieldMask` and **must** be named `update_mask`. | ||
| - `update_mask` is **required**. An absent or empty mask **must** return `INVALID_ARGUMENT`. |
There was a problem hiding this comment.
What happens if output-only fields are included in the mask? Ignored or error?
| // Immutable throughout the lifecycle of the resource. | ||
| string uid = 3; | ||
|
|
||
| // version is incremented on every mutation. |
There was a problem hiding this comment.
Can we say "increased" rather than incremented? That leaves more room
| google.protobuf.Timestamp create_time = 5; | ||
|
|
||
| // update_time is the time the resource was last updated by a user action. | ||
| google.protobuf.Timestamp update_time = 6; |
There was a problem hiding this comment.
I assume delete_time comes when we have soft delete?
|
|
||
| *Inspired by [AIP-154](https://google.aip.dev/154). Diverges in field name and type.* | ||
|
|
||
| When two clients update the same resource concurrently, the second write may silently overwrite the first. Freshness validation lets a client prove it is operating on the state it thinks it is, so the server can reject stale writes. |
There was a problem hiding this comment.
a) Should we add UID preconditions now, later, or never?
b) Should we require any preconditions for any ops or is it totally up to users?
Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate, WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes CRDs), following the draft API style guide (agent-substrate#351), plus design rationale and CRD migration notes. For review only: codegen and server implementation intentionally lag the proto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate, WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes CRDs), following the draft API style guide (agent-substrate#351), plus design rationale and CRD migration notes. For review only: codegen and server implementation intentionally lag the proto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by-:
Draft the target ateapi surface for issue agent-substrate#368 (ActorTemplate, WorkerPool, SandboxConfig, WorkerPoolGrant moved out of Kubernetes CRDs), following the draft API style guide (agent-substrate#351), plus design rationale and CRD migration notes. For review only: codegen and server implementation intentionally lag the proto.
Add an initial version of a style guide for the substrate public API. It uses Google's AIP as a starting point, and clarified when we intentionally deviate from it.