Skip to content

V1.1 early draft preview#151

Draft
pavel-kirienko wants to merge 15 commits into
masterfrom
v1.1-draft
Draft

V1.1 early draft preview#151
pavel-kirienko wants to merge 15 commits into
masterfrom
v1.1-draft

Conversation

@pavel-kirienko

Copy link
Copy Markdown
Member

I wrote this today with heavy assistance from Opus. The content was generally generated based on the formal models and the reference implementation, both of which can be found in https://github.com/OpenCyphal-Garage/cy. That is the source of truth, while the specification is a mere compilation.

Content-wise this is nearly complete, but there are many stylistic issues that I have intentionally left unresolved for now. Please focus on the idea first of all and ignore the style.

The essential changes are:

  • The DSDL and application layer chapters are completely removed. This necessitated a few collateral changes throughout. I have created a new repository for the upcoming DSDL spec extraction but I am not planning to work on that anytime soon yet; help is welcome: https://github.com/OpenCyphal/dsdl_specification

  • The transport layer chapter was reworked to remove most of the transport-agnostic concepts, because they are not transport-agnostic anymore but rather CAN-specific. The CAN chapter therefore may seem heavily edited but it is actually not; the only substantive changes are the addition of the new 16-bit subject-ID CAN ID layout and notes on backward compatibility.

  • The Cyphal/UDP spec has been updated to match the new proposed UDP transport design. The old UDP transport is for now removed, but as we discussed at the last call we may elect to keep it under a different name. I prefer to remove the old one completely; if we keep it, someone must undertake to maintain it.

  • A completely new Session layer chapter is added.

  • Many auxiliary materials and design rationale are provided in the appendices, which are mostly derived from the formal models in the Cy repo.

I do not recommend looking at the diff. Download the built PDF from the CI and read that instead. Treat it as a new document and ignore the delta from v1.0 for now.


\***********************************************************************************************************************
\* Subject-ID mapping function. The ring size is the total number of distinct subject-IDs.
\* TODO: Switch to quadratic probing: https://github.com/OpenCyphal-Garage/cy/issues/12

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to resolve this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a stale comment that should be removed. The model uses linear probing because it is simpler to reason about in the formal verification context. The actual implementation uses quadratic probing and this difference is believed to be insignificant for the purposes of the consensus protocol, as noted in the adjacent appendix Convergence proof for the allocation CRDT under Instantiating A3/A4 for common probing laws.

\item ``Consistent Overhead Byte Stuffing'', Stuart Cheshire and Mary Baker.
\item Extracted the DSDL specification into a separate document.
\item Introduced named topics and the session layer, along with the distributed consensus algorithm.
\item Stabilized the experimental Cyphal/UDP transport.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've broken it. We need to be clear that this is a non-backwards-compatible change but one we consider finalised. As discussed in the devcall; it's important we produce an appendix that is about UDP v1.0 -> v1.1 upgrade path and interoperability. We also need to produce a rationale for the breaking change.

@@ -1,232 +0,0 @@
\section{Cyphal/serial (experimental)}\label{sec:transport_serial}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not carry this forward while remaining experimental? If we did create a full version of this in v1.x we wouldn't want a gap where the specification had a serial protocol, dropped it, then introduced another one. It would be better to have had one, experimental protocol we finally got around to promoting.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because it carries the same unwanted features we had in the old UDP transport, and at this point I chose to not define a new serial transport, chiefly because I don't have an immediate use case for it and I don't know anyone who does. I don't think it has seen any significant usage in the field otherwise we would have probably heard about it on the forum or elsewhere.

For example, the pattern \texttt{sensors/*/data} matches \texttt{sensors/imu/data}
and \texttt{sensors/gps/data} but not \texttt{sensors/imu/raw/data}.

\item[\texttt{>} (ASCII $62$)] Matches zero or more trailing segments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude flagged this and I agree we should clarify it so anyone using LLMs to generate parsers doesn't get this wrong:

Pattern-token boundary semantics for > are under-specified.
§2.3.2 says > matches "zero or more trailing segments" and gives the example that sensors/> matches sensors. After normalization (which "trims leading and trailing separators"), sensors/> does not get its / trimmed because > is not a separator, but matching a name with zero trailing segments against a pattern containing /> is only correct if the slash is treated as elidable when zero segments follow. The spec should explicitly state how the slash before > interacts with zero matches (currently inferred only from the example).


Every session-layer message begins with a fixed-size header of 24 bytes,
followed by a type-specific payload. Multi-byte integer fields are encoded in
little-endian byte order and are positioned to favour natural alignment on

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an LLM optimization as Claude panicked when it saw a uint48 that was not positioned on a natural alignment:

The spec could spell out this rationale once: "8-byte fields are placed on 8-byte boundaries; smaller integer fields are packed without further alignment." That would close off readers like me who try to extrapolate "favour" into something stronger than it is.

@pavel-kirienko pavel-kirienko May 22, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording issue. What I actually meant when I wrote this in the source document (that Claude later transformed into this spec) is that the layout is easy to represent in a C structure without it being packed. I think it's best to just say it directly.

The idea is that sequences like uint48 followed by uint16 are represented as a single uint64_t. Now when I write this I realize that this is, indeed, packing, but it is manual and predictable, unlike __attribute__((packed)) et al.

\input{introduction/introduction.tex}
\input{basic/basic.tex}
\input{dsdl/dsdl.tex}
\input{session/session.tex}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't transport come first? (bottom up?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way but I think the session layer is the main character of this document so there might be some value in keeping it up-front?

\begin{minted}{python}
uint8 type # 2 for msg_ack, 3 for msg_nack.
void24
uint32 incompatibility

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is incompatibility uint32 here and uint8 in the session_header_msg?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh it just depends on the number of unused bits. More bits --> wider incompatibility field. No other reasoning.


\subsection{Pinned topics}

Pinned topics are ordinary CRDT records with their eviction counter pre-set to

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to make V1.0 interop impossible, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should expand a bit on this topic. They are ordinary CDRT records, but the consensus protocol is not required for a consensus to be reached, because it can be enforced by configuring pinning on all nodes manually. I will fix this when I'm back to the spec.

@pavel-kirienko

Copy link
Copy Markdown
Member Author

Check v1.0 interop description

@thirtytwobits

thirtytwobits commented Jun 19, 2026

Copy link
Copy Markdown
Member

TSN Prototype

I propose we retain this as a draft but add a caveat in the draft that the UDP protocol is under active development and breaking changes are expected before 1.1 is approved.

I furthermore propose that we undertake a UDP/TSN prototype that proves out deployment of Cyphal UDP configured for the following TSN protocols:

  • IEEE 802.1CB — Frame Replication and Elimination for Reliability (FRER): Provides redundant transmission of identified streams over multiple paths and eliminates duplicate frames at the receiver to improve reliability.
  • IEEE 802.1AS — Timing and Synchronization for Time-Sensitive Applications: Defines the generalized PTP/gPTP timing profile used to synchronize clocks across TSN-capable bridges and end stations.
  • IEEE 802.1Qav — Forwarding and Queuing Enhancements for Time-Sensitive Streams: Defines credit-based shaping for AVB/TSN traffic classes to provide bounded-latency forwarding without strict time slots.
  • IEEE 802.1Qbv/Qbu — Enhancements for Scheduled Traffic / Frame Preemption: 802.1Qbv defines time-aware scheduled gates for deterministic transmission windows, while 802.1Qbu allows high-priority express frames to preempt lower-priority preemptable frames to reduce worst-case latency.
  • IEEE 802.1Qci — Per-Stream Filtering and Policing: Adds ingress per-stream filtering, gating, and policing so that misbehaving or out-of-contract streams can be constrained before they disturb the TSN schedule.
  • IEEE 802.1Qch — Cyclic Queuing and Forwarding: Defines cyclic buffering/forwarding behavior where traffic advances through the network in synchronized cycles, giving deterministic latency under suitable timing assumptions.
  • IEEE 802.1Qcr — Asynchronous Traffic Shaping: Defines ATS, a per-flow shaping mechanism intended to provide bounded latency without requiring globally synchronized scheduled transmission windows.
  • IEEE 802.1Qat — Stream Reservation Protocol (SRP): Defines the original distributed reservation protocol for registering streams and reserving bridge resources along their paths; note that 802.1Qcc later specifies SRP enhancements and performance improvements.

Additionally we should build a 10-base T1S prototype and publish a guide for configuring Cyphal 1.1 for this topology. Given the expectation that 10-base T1S multidrop is positioned as a successor to CAN, this should be a priority for our project.

Background

Here's Claude's take on some of the issues we need to prototype, verify, and adapt to:

Summary of TSN Issues With the Proposed Protocol

After FRER, gPTP, CBS, PSFP, and SRP, the v1.1 protocol gaps cluster cleanly:

  • Configuration enablers (needed by all five): PCP mapping, deterministic unicast addressing, bounded receiver state.
  • Active misbehavior under TSN conditions: reliable retransmission against a flow meter (PSFP), uncontrolled scout responses against any policer or attribute table (PSFP and SRP), uncontrolled pattern-match scope (SRP).
  • Design-level mismatch (new with SRP): the CRDT's dynamic subject-ID allocation fights SRP's stable-reservation model. Pinning is the intended escape hatch, but §2.5.6's "request, not guarantee" semantics defeats it.

The first set is incremental documentation and small protocol additions. The second set is bug-class — the protocol does the wrong thing under expected TSN conditions and should be fixed. The third set is the first design tension where v1.1 may need to choose: either commit to a "pinned subjects are stable contracts" model that fully supports SRP, or commit to "dynamic allocation always wins" and tell integrators to use centralized CNC for any TSN deployment. Both are defensible answers, but the spec needs to pick one.


Protocol-level changes that would make v1.1 Cyphal/UDP fundamentally friendlier to 802.1CB

Five candidate changes, in roughly decreasing order of justification.

1. Restore a deterministic unicast endpoint (or make UID-derived)

This is the single biggest regression. v1.0 used 239.1.0.<dst-node-ID> for service unicast; an integrator could pre-provision a FRER stream per known destination. v1.1 §3.4.2.2 replaces this with "address the recipient's UDP/IP endpoint as observed on an earlier datagram." Practical consequence for 802.1CB:

  • A bridge cannot pre-install a Sequence Identification + Sequence Recovery rule for a peer until that peer has emitted observable traffic. First-contact unicast cannot be FRER-protected.
  • FRER stream-identifier tables would have to be reconfigured at runtime as new endpoints are observed, which is the opposite of how TSN configuration is normally done (static, validated at integration time).
  • After a peer's MAC/IP changes (the very migration capability the new model is designed to support), every bridge's sequence-recovery state for that peer's streams becomes stale.

A protocol-level fix would be a deterministic mapping from sender_uid (or a 16-bit hash of it) to a unicast multicast group within the reserved range, e.g. 239.<reserved-octet>.<UID-low-bits>, used for first contact and as a stable fallback. The existing "learn endpoint and switch to native unicast" optimization can stay, but predictable FRER pre-provisioning needs a deterministic name.

2. Specify 802.1Q PCP mapping alongside the DSCP mapping

§3.4.3 mandates a DSCP mapping but says nothing about 802.1Q Priority Code Point (PCP). TSN traffic shaping (Credit-Based Shaper §8.6.8.2, Time-Aware Shaper §8.6.8.4, and the Asynchronous Traffic Shaper) operates on PCP, not DSCP. Without a normative PCP recommendation, two interoperating vendors will land on different traffic-class assignments and FRER+CBS/TAS will deliver inconsistent latency budgets.

Add a Table 3.7 with the recommended Cyphal-priority→PCP mapping (probably PCP = priority, since both are 3-bit and ordered the same way), and note that integrators using TSN scheduling shall configure both consistently.

3. Carve the auto-allocated subject-ID range into redundancy/QoS classes

Today the auto-allocator (§2.4.2, §2.5.4) places every non-pinned topic uniformly in [8192, 8191+M]. A FRER bridge that wants to protect "all critical control topics" but not best-effort telemetry has to enumerate individual multicast MAC addresses — up to tens of thousands of them — because there's no structural separation in the subject-ID space.

Three options, in increasing protocol intrusiveness:

  • Sub-ranges via topic-name prefix convention. Reserve a high-order portion of the auto-allocated range for topics whose name carries a recognized prefix (e.g., crit/...). The allocator constrains those topics into the upper sub-range; bridges can write one FRER rule per sub-range.
  • Allocation-class as a CRDT field. Add a 2-bit class to the topic record (§2.5.1) that constrains which sub-range the topic is allowed to occupy. Increases CRDT state and gossip size; affects the convergence proof in App. C.
  • Multiple moduli partitioning the multicast space. Define M_critical, M_normal, etc. The protocol pin pattern (§2.3.3) already does something analogous for explicit subject-IDs; generalizing for auto-allocation lets bridges write coarse FRER rules.

The first option is the least invasive and probably sufficient.

4. Bound the replay-cache state per sender_uid

§3.4.6 swaps the v1.0 transfer-ID timeout for a randomized 48-bit monotonic transfer_id. The receiver-side rule that follows from this is "deduplicate by (sender_uid, transfer_id) and trust the sender never to reuse a value" — but the spec doesn't say how large that deduplication state may grow, how it ages out, or what the receiver should do on overflow.

802.1CB's Sequence Recovery has an explicit, bounded RecovWindow (typically 64 entries). If Cyphal's replay cache is unbounded, FRER's bounded model is the weaker link in the chain and you can't reason about end-to-end memory or worst-case latency. Add a normative or recommended bound: e.g., "receivers shall track at least the highest accepted transfer_id per sender_uid and may discard frames whose id falls more than 2^32 below it; the per-sender state shall age out after a recommended T_session of N seconds."

This is independently useful for embedded receivers, but the absence is more glaring once you put FRER underneath because the protocol then has two different recovery-window models with no relationship to each other.

5. Replace "occasional reordering tolerance" with an explicit reorder bound

§3.4.4 says receivers "shall cope with [out-of-order and interleaving] both conditions" but doesn't bound how far out of order frames may arrive. FRER's sequence-recovery output can include modest reordering (configurable per RecovWindow size, sometimes ~30+ positions). If a Cyphal receiver implementation chooses a small reassembly window, FRER-delivered frames may be dropped that the bridge considered valid.

Specify a minimum reassembly window depth in frames (or time) that receivers must support per (sender_uid, transfer_id). Pick a number that comfortably exceeds typical FRER RecovWindow defaults. This makes "Cyphal/UDP over FRER" a checkable interoperability claim rather than a hope.

Lower-priority changes I'd flag but not push for:

  • A priority field at fixed L3/L4 location that an IP-SID bridge can match without parsing the Cyphal header — already achievable via DSCP, so probably redundant.
  • Surfacing sender_uid in an L3 header field for L2 stream filtering — would require a wire-format break for a feature only some integrators want; not worth it.
  • A protocol-level periodic-heartbeat requirement on critical subjects to keep FRER stream state warm — better left as a deployment recommendation in §2.14 than as a protocol obligation, because it would impose traffic on resource-constrained nodes that don't need FRER.

If only one change makes it into v1.1-beta, make it #1 (deterministic unicast endpoint). Everything else is tractable at deployment time; the loss of a predictable unicast address is the only change that genuinely closes off a TSN deployment pattern that worked in v1.0.


Cyphal/UDP v1.1 and 802.1AS

Cyphal/UDP v1.1 and 802.1AS are essentially orthogonal at the wire level. 802.1AS rides L2 with EtherType 0x88F7 and reserved PTP multicast MACs (01-80-C2-00-00-0E); Cyphal/UDP rides L4 over IPv4 multicast in 239.0.0.0/9. They share no header bytes, no MAC addresses, no IP groups, no UDP ports. Nothing in v1.1 prevents 802.1AS from running on the same fabric.

That said, three things are worth surfacing.

  1. v1.1 removed the only built-in time-sync mechanism — so 802.1AS is now the default answer, but the spec doesn't say so
    v1.0 had uavcan.time (Synchronization, SynchronizedTimestamp, TAIInfo, TimeSystem, GetSynchronizationMasterInfo) — an in-band, Cyphal-native time distribution protocol. v1.1 deleted the entire Application Layer chapter, including all of uavcan.time. Applications that need a shared clock now have no in-document option.

This isn't a conflict with 802.1AS — it's an implicit upgrade: gPTP is now the natural answer for synchronized time on a Cyphal/UDP v1.1 network. But the spec is silent on it. A one-line normative note in §3.4 — "Cyphal/UDP does not provide time synchronization; deployments requiring a shared clock should use IEEE 802.1AS or equivalent" — would close the obvious question that a v1.0 reader will ask.

  1. CRDT age must come from a monotonic local clock, not gPTP-disciplined wall time
    §2.5.2 says "age is a local real or logical clock counting roughly seconds since the topic was first seen on the network." A naive implementation might read wall-clock seconds. If wall time is disciplined by gPTP (or NTP, or any external sync), a Grandmaster changeover can step the clock backward by milliseconds to seconds. That would drive lage = floor(log2(age)) to bin-flip during a quiescent stability window, violating assumption A9 (stability window) in §2.5.7 and breaking the convergence guarantee for the duration of the perturbation.

The fix is purely a clarification: state that age_seconds shall be derived from a strictly-monotonic, never-stepped local time source (e.g., CLOCK_MONOTONIC on POSIX, free-running tick counter on bare metal). The protocol bits don't change.

  1. §3.4.7 "minimal stack" advice should not be read as "drop non-IP traffic"
    §3.4.7 tells real-time/resource-constrained systems they may omit IP fragmentation and ICMP, and recommends suppressing ICMP at the equipment level. A literal-minded integrator implementing a "Cyphal-only" filter could end up dropping PTP frames too — they're non-IP, never come up on a UDP socket, and look superficially like "other traffic" to anyone whose mental model is "UDP/9382 + ARP/IGMP, drop everything else."

A clarifying sentence in §3.4.7 ("This guidance pertains to the IP stack only; L2 protocols such as IEEE 802.1AS, LLDP, and SRP shall be left intact where their services are required.") would prevent that mistake without changing any protocol behavior.

Small things, in case they come up

Transfer-ID seed entropy (§3.4.6). The recommended seed sources include "the current value of an RTC combined with a persistent counter." If the RTC is gPTP-disciplined, two co-located nodes powered up simultaneously may seed transfer-ID counters from near-identical RTC values. Other entropy sources listed (uninitialized SRAM, ADC noise, persistent counter) already protect against this in practice, but worth being explicit that the time component alone is insufficient on a synchronized network.

PCP allocation. Picking up the PCP-mapping recommendation I flagged earlier: whichever PCP table Cyphal recommends should leave the TSN-reserved high classes (typically PCP 6 and 7 for gPTP/network control per AVnu and IEEE 802.1Q TSN profiles) untouched. The pragmatic mapping is Cyphal priority ∈ {0..5} → PCP {3..0,?} or similar, never PCP 6/7.

Hardware timestamping contention. Some NICs share a single hardware timestamping unit between PTP and arbitrary frame timestamping. If integrators want to timestamp Cyphal frame ingress/egress with the gPTP-disciplined clock, NIC selection matters. Not a protocol concern; deployment guidance at most.
Bottom line

Conclusion

Nothing in v1.1 makes 802.1AS support harder than it was in v1.0 — and in one respect easier, because v1.1 has fewer competing notions of time on the wire. The only real protocol-level adjustment worth making is the implementation guidance about age's clock source (#2 above); the rest is documentation.


Cyphal/UDP v1.1 and IEEE 802.1Qav (Credit-Based Shaper)

Works in principle; the same PCP gap I flagged for 802.1CB is the central blocker, and a couple of v1.1 mechanisms produce burst patterns that CBS sizing needs to account for.

What works cleanly

CBS shapes per-PCP egress queues using idleSlope/sendSlope credits, with the latency math driven by MaxFrameSize and the burst profile per class. The v1.1 traffic model fits this:

  • The "all frames except the last shall carry the same amount of frame payload" rule (§3.4.5) gives bridges a predictable MaxFrameSize per stream.
  • Deduplicated, out-of-order-tolerant receivers (§3.4.4) absorb the moderate jitter CBS introduces.
  • Fixed-size session header (24 B) plus fixed-size UDP header (32 B Cyphal + 8 B UDP + 20 B IP) means frame overhead is constant across a class.
  • DSCP is already mapped (§3.4.3), so CBS-classified frames travel with a coherent L3 priority too.

The blocker (same as the FRER answer)

§3.4.3 specifies DSCP only, not PCP. CBS lives in the bridge's L2 egress queue and selects queue by 802.1Q PCP. Without a normative PCP mapping in the spec, two independent vendors will end up putting Cyphal traffic into different CBS classes, and integrators have no canonical way to say "all Cyphal nominal-priority traffic shall go to SR-B."

Same fix as before: add a recommended Cyphal-priority → PCP table, leaving PCP 6 and 7 free for 802.1AS / network control. Once that exists, CBS becomes a straightforward integration exercise.

Burst patterns the spec should document for CBS sizing

CBS provisioning is driven by worst-case burst, not average rate. Three v1.1 mechanisms produce bursts that are easy to under-estimate:

  1. Urgent gossip (§2.6.4). When m_u nodes observe the same CRDT conflict, the expected number of urgent emissions is 1 + (m_u − 1)·δ/W_u. For default W_u ≈ 10 ms and δ ≈ 1 ms, this is a small handful of broadcast-subject frames within a millisecond window across the network. CBS sizing for the broadcast-subject class needs to cover this worst case, not the steady-state ~5 s gossip period.

  2. Reliable retransmission fallback (§2.10.3). After an initial multicast msg_rel, the publisher may fall back to N unicast retransmissions to holdouts. The exponential backoff bounds time spread, but the first round of N unicasts can land back-to-back. If reliable traffic shares a CBS class with other unicast traffic from the same talker, the class's MaxIntervalFrames has to accommodate N (where N is the worst-case association set size).

  3. Multi-frame RPC streaming responses (§2.12.2). A responder can emit an unbounded stream of rsp_be/rsp_rel frames per request, each with incrementing seqno. Large structured responses get fragmented into MTU-sized frames; with the equal-size rule (§3.4.5) this is ceil(size/MTU) frames in rapid succession. CBS sizing has to assume bursts at the transfer level, not the per-frame level.

None of these are broken under CBS — they're just under-documented. A non-normative §3.4 subsection enumerating "session-layer behaviours that produce frame bursts" with rough worst-case formulas would let TSN integrators provision shaper classes without reverse-engineering the spec.

One subtle interaction worth fixing in the protocol itself

§3.4.6 says randomized 48-bit transfer_id "obviates the need for a transfer-ID timeout at the receiver." But the spec is silent on reassembly-state aging for in-flight multi-frame transfers. Under CBS-induced per-hop jitter (especially on a heavily-loaded SR-B class), a multi-frame transfer's frames can arrive over tens to hundreds of milliseconds. With no specified upper bound on how long a receiver waits for missing frames, two failure modes appear:

  • A receiver that picks too short a reassembly hold-time drops transfers that CBS legitimately delayed.
  • A receiver that picks too long (or no) bound accumulates partial-transfer state indefinitely from any peer that crashes mid-transfer.

This is the same gap I flagged for FRER under "bound the replay cache," but for the reassembly side rather than the duplicate-suppression side. Add a normative or recommended per-(sender_uid, transfer_id) reassembly hold-time bound, scoped to comfortably exceed worst-case CBS jitter plus retransmission. Without it, CBS deployments will produce sporadic, hard-to-diagnose transfer drops.

What would make v1.1 first-class CBS-friendly

In priority order:

  1. Recommended PCP mapping (table parallel to §3.4.3's DSCP table).
  2. Bounded reassembly hold-time in §3.4.4 or §3.4.5.
  3. Per-subject class hint so an application can declare "this subject is critical" and have the resulting PCP set consistently. Today the only mechanism is the integrator manually configuring socket priority per published subject — workable but error-prone. The simplest version is documentation: "Pinned topics (§2.3.3) are the recommended way to attach a subject to a TSN-shaped class because they give deterministic subject-IDs that can be enumerated in CNC stream configurations."
  4. Burst characterization paragraph (non-normative) covering urgent gossip, reliable retransmission, and streaming-response worst cases.

Direct conflicts? None

Nothing in v1.1 actively breaks CBS. The MTU equal-size rule, the deduplication model, randomized monotonic transfer-IDs, and the deferred-ACK reliability scheme all play well with credit-based shaping. The protocol is CBS-compatible; the spec is CBS-uninformed. Fixing the PCP gap and the reassembly-bound gap closes most of the distance.


Cyphal/UDP v1.1 and IEEE 802.1Qci (Per-Stream Filtering and Policing)

PSFP is the first TSN component that has two genuine protocol-level conflicts with v1.1, on top of the recurring PCP/identification gap. Both deserve fixes in the spec, not just deployment notes.

What works

PSFP's Stream Filter and Stream Identification reuse the same machinery as 802.1CB. Cyphal's per-subject multicast groups (§3.4.2.1) give natural per-stream identification, and the fixed UDP destination port 9382 makes 802.1CBdb's IP-SID variant straightforward. The bridge has everything it needs to attach a filter to a specific subject-stream.

The Stream Gate (time-windowed admission) is a non-issue for v1.1 in the negative sense: it's irrelevant. Cyphal's control plane (gossip, scout, RPC) is event-driven and aperiodic by design. Application subjects that need gated admission run TAS underneath them; PSFP gates inherit from that, and Cyphal doesn't have to participate.

Two genuine protocol conflicts

Conflict 1: Reliable retransmission can sustain a feedback loop against a flow meter

§2.10.3's exponential-backoff retransmission has no signal for distinguishing "frame lost on the wire" from "frame dropped by an ingress policer." Under a misconfigured PSFP Flow Meter, the loop runs as follows:

  1. Publisher sends msg_rel. Policer drops it.
  2. ACK baseline timeout (~16 ms) expires; publisher retransmits.
  3. Policer drops it again.
  4. Timeout doubles; publisher retransmits.
  5. ...until either the publication deadline expires or slack > 2 ages the association out (§2.10.1).

The doubling backoff limits long-term amplification, but during the first ~5-10 retransmissions the publisher is hammering the meter with the same frame and contributing to the very congestion the meter is protecting against. There is no path by which the publisher can learn it is being policed.

This is worse than ordinary packet loss because PSFP drops are deterministic — every retransmission of the same stream meets the same fate, so backoff doesn't probe an improving condition. The publication will reach DONE, DELIVERY FAILURE (§2.10.6) but spend its entire deadline on doomed retries.

A protocol-level fix: after K consecutive retransmissions all reaching the deadline without any ACK from any associated peer, treat the stream as policed and escalate to a longer hold-down rather than doubling further. Add a new observable state — perhaps DONE, RATE LIMITED — distinguishing this from packet loss. The receiver-side dual would be a hint when ingress rate drops below a peer's recent average, but that's optional; the publisher-side change is the load-bearing one.

Conflict 2: Scout has no built-in rate limit on responses

§2.7: "A scout message is broadcast on the broadcast subject ... every node that receives the message checks its local topics against the pattern; for each match, the receiver replies with a gossip header describing the matched topic."

Concretely, a scout with pattern > on a network with K total topics distributed across N nodes elicits K gossip responses, with no normative throttle. The protocol's only mitigation is "responses are typically unicast back to the originating node," which spares the broadcast group but does nothing for the requester's ingress or for an intervening policer.

PSFP will catch this at the requester's ingress bridge — which is fine for the network but bad for the requester, because the very responses they asked for get dropped and they have no way to retry partial results. Worse, a malicious or buggy scout (subscribing to > to "see everything") creates a deterministic broadcast-subject overload that's hard to attribute.

A protocol-level fix: §2.7 should specify a per-responder cap on the response rate to a single scout — for example, "a responder shall jitter scout replies across a window of W·match_count milliseconds and shall not emit more than R replies per second per requesting node, dropping or deferring the remainder." This makes scout safe to deploy on a policed network without PSFP picking up the slack.

The recurring PCP/identification blocker

Same as for 802.1Qav and 802.1CB: PSFP filters are typically keyed on Stream Identifier = DA + VID, with PCP as an optional component (per-stream-per-PCP gates). Without a normative PCP mapping in §3.4.3, integrators can't write rules like "police the SR-B class of node X to N Mbps" because they don't know which PCP Cyphal will emit on.

Pinned topics (§2.3.3) remain the only way to get deterministic, pre-provisionable stream identifiers. Auto-allocated topics can't appear in a CNC-loaded PSFP rule set until the CRDT converges and the integrator (or some discovery agent) updates the bridge config. State this explicitly in §2.4.3 or §3.4.2: "Subjects whose traffic is subject to TSN per-stream policies should be pinned."

Burst envelopes PSFP needs to know about

Flow meters need committed/excess rates and burst sizes. The four bursty patterns I noted under CBS apply here too, but PSFP makes them more concrete because the consequence of under-sizing is not just queuing delay, it's frame loss. The worst-case envelopes that a PSFP-aware deployment needs:

  • Urgent gossip (§2.6.4): up to ~m_u broadcast-subject frames in a window of W_u, where m_u is the number of nodes that observe a CRDT conflict simultaneously. Bound by network size, not by traffic class.
  • Startup gossip (App. F.3): one node's first round emits a gossip per local topic, jittered over W_s. Sustained at K_x frames per W_s during startup.
  • Reliable retransmission to holdouts (§2.10.3): up to N unicast frames of the same payload per backoff window, where N is association set size.
  • Streaming RPC (§2.12): unbounded per-request; the burst envelope is application-defined.
  • Scout reply storms (§2.7, until fixed per Conflict 2 above): up to K replies per scout, where K = total matching topics across all responders.

A non-normative subsection in §3.4 listing these with formulas would let TSN integrators provision Flow Meters that don't surprise them.

One subtle interaction

PSFP can disable a Stream Filter entirely on sustained violation. If that happens to a stream serving a pinned subject, the Cyphal publisher experiences total loss and the receiver experiences none — but the publisher has no way to know its stream was disabled vs. that nobody is listening. The CRDT may respond by evicting the topic to a new subject-ID (eviction counter increments on local heuristics if the implementation interprets sustained loss as conflict), changing the multicast group, leaving the disabled filter pointing at a stream nobody uses, and the new stream un-provisioned. The result: a publisher that was stable becomes unreachable and starts hunting through subject-IDs.

The protocol doesn't actively cause this — the CRDT increments eviction counters on observed gossip conflicts, not on transmission failure — but an implementation that conflates them would. A defensive note in §2.5.4 ("eviction counters shall be incremented only in response to observed CRDT events, never in response to transmit-side failures") would close this off.

What would make v1.1 first-class PSFP-friendly

In priority order:

  1. Detect and break the reliable-retransmission feedback loop. Add an observable DONE, RATE LIMITED state and a back-off-harder rule after K consecutive deadline-expirations without any ACK.
  2. Rate-limit scout replies in the spec. A few sentences in §2.7 specifying jitter and a per-requester cap.
  3. PCP mapping (same recurring fix).
  4. State that pinned topics are the recommended path for TSN-policed traffic.
  5. Burst-envelope subsection (non-normative; useful for FRER, CBS, and PSFP planning all at once).
  6. Defensive note that eviction counters don't react to transmit failure.

Pattern across the TSN questions, updated

After FRER, gPTP, CBS, and now PSFP, the recurring protocol gaps in v1.1 are:

  • No PCP mapping — needed by FRER (stream classification), CBS (queue selection), TAS (slot allocation), and PSFP (per-stream-per-PCP rules).
  • Bounded receiver state — needed by FRER (sequence window), CBS (reassembly hold-time under jitter), and PSFP (consequence sizing).
  • Deterministic stream identification for unicast — needed by FRER and PSFP for pre-provisioning.
  • Bounded protocol-induced bursts — particularly urgent gossip, reliable retransmission, scout reply, and the new ask: distinguishing policed loss from packet loss in the reliable retransmission state machine.

The first three are configuration/deployment enablers — the protocol works without them but integrators can't provision the network deterministically. The fourth is the first set of TSN questions where the protocol itself misbehaves under expected TSN conditions: reliable retransmission against a flow meter, and uncontrolled scout responses against any policer. Those are worth fixing in v1.1-beta rather than deferring to deployment guidance.


Cyphal/UDP v1.1 and IEEE 802.1Qat (Stream Reservation Protocol)

This is the first TSN component where the core v1.1 design philosophy — decentralized, CRDT-driven subject-ID allocation — directly contradicts what the standard expects. The conflicts are real but the practical impact depends heavily on whether the deployment uses dynamic SRP (MSRP/MMRP/MVRP) or centralized CNC configuration.

The fundamental tension

SRP/MSRP expects a stable triple for each reservation: StreamID (≈ talker MAC + unique counter), destination MAC (the multicast group), and traffic profile (MaxFrameSize, MaxIntervalFrames, class, accumulated_latency). Once the reservation is established along the path, bridges allocate CBS credits, listeners register via MMRP, and the path is considered "Ready" — for as long as none of those parameters change.

Cyphal v1.1's CRDT allocation makes the destination multicast MAC a runtime-variable function of network state. Whenever the eviction counter for a topic increments — a normal, expected CRDT event — the subject-ID changes, the multicast group changes, and the SRP reservation is silently stranded. The talker has to withdraw the old MSRP advertisement and emit a new one; every listener has to MMRP-deregister and re-register; every bridge has to tear down and rebuild CBS reservations along the path. None of this is observable to the Cyphal session layer.

This isn't fatal — SRP supports dynamic add/withdraw of streams. But it converts every CRDT eviction into a network-wide reservation churn event, which is the opposite of what SRP is designed for.

Two protocol-level conflicts

Conflict 1: Pinning is "a request, not a guarantee" (§2.5.6) — but SRP needs guarantees

§2.5.6 explicitly says pinned topics participate in arbitration on equal terms with non-pinned topics, and that "an older non-pinned topic can displace a younger pinned newcomer." For SRP, this is structurally broken: a deployment that pre-configures CBS reservations for pinned subjects can find at runtime that the CRDT has moved one of its pinned topics off its pin because an older auto-allocated topic landed on the same subject-ID first. The reservation now points at a multicast group that nobody is publishing on, and the actually-pinned topic is on a different subject-ID with no reservation.

For TSN integrators, this means pinning cannot be relied on as a stable contract with the network. That defeats the purpose of pinning for TSN.

Three possible fixes:

  • Strongest: pinned topics never lose arbitration. The CRDT's arbitration rule (§2.5.3) is amended so that any collision between pinned and non-pinned resolves in favor of pinned, and any collision between two pinned topics resolves to whichever one was integrated first (deterministic by topic hash or by deployment-time tiebreaker). Requires updating the convergence proof in App. C; the proof structure still works because pinned topics form a fixed set with a strict priority.
  • Middle ground: add a separate "reserved" subject-ID range that the auto-allocator cannot touch and that arbitration cannot evict from. Pinned-into-reserved topics get the TSN-compatible guarantee; pinned-into-unreserved retain the existing "request" semantics for backward compatibility.
  • Documentation-only: state explicitly that subjects intended for TSN SRP reservation must be placed in the pinned range AND that integrators must ensure no auto-allocated topic ever hashes into those reserved IDs. Brittle, but zero protocol change.

I'd push for the first or second; the third pushes a load-bearing constraint into integrator discipline.

Conflict 2: Cyphal has no per-subject traffic profile to feed into MSRP

MSRP advertisements need MaxFrameSize, MaxIntervalFrames, accumulated_latency, and traffic class per stream. v1.1's session layer carries none of this. A Cyphal publisher creates a topic and starts emitting; the protocol has no notion of "this topic will emit at most N frames of M bytes per class measurement interval."

The integrator has to source MSRP advertisements from outside the protocol — typically a CNC/CUC configuration that the application loads at startup. That works, but it means:

  • The same information lives in two places (the application's publish rate and the CNC's reservation profile) with no enforcement that they agree.
  • A misconfigured publisher emitting faster than its reservation gets policed by the bridge (PSFP) or simply preempted by CBS credit exhaustion. Cyphal has no signal for this (see the PSFP analysis, Conflict 1).
  • Dynamic SRP — where a publisher would advertise its own profile — is not workable because the protocol can't generate the advertisement.

A protocol-level fix: extend the CRDT topic record (§2.5.1) or add a separate per-subject metadata object that carries declared MaxFrameSize, expected frame rate, and recommended SR class. Disseminated via gossip alongside the existing CRDT state. This is a substantial addition — bigger gossip messages, new arbitration rules for conflicting profile claims, possible convergence-proof revisions — but it would let Cyphal nodes participate in dynamic SRP without out-of-band configuration.

The minimum-viable version: a single 16-bit "stream class hint" field (PCP-aligned) added to the gossip header, with no arbitration impact (latest-write-wins per node, advisory only). Enough for an integrator's tooling to auto-generate CNC configurations from the live network state.

A third conflict that's already in scope from earlier answers: pattern subscriptions inflate MMRP

§2.7 and §2.13 allow pattern subscriptions like sensors/* or >. The listener side of SRP is MMRP — Multiple MAC Registration Protocol — which registers interest in a specific multicast MAC. A pattern subscription that matches K topics requires K MMRP registrations from the listener, propagating up the spanning tree to every bridge.

For > over an auto-allocated subject space, this is potentially every multicast group in 239.0.0.0 | (s & 0x7FFFFF). Most bridges' MMRP attribute tables are sized in the thousands, not the millions; a pathological > subscriber can exhaust them.

This is the same root cause as the PSFP scout-storm conflict (§2.7 has no inherent rate limit on responses) but the consequence here is L2 state exhaustion in bridges rather than rate-limited drops. Fix is similar: bound the pattern-matched topic set per listener, either by spec ("a listener shall not maintain MMRP registrations exceeding implementation-defined limit L") or by pattern-syntax restriction (no unbounded > patterns; require an explicit depth bound).

What works without protocol changes

  • Pinned topics in a centralized-CNC deployment. If TSN reservations are pre-provisioned by a CNC and pushed via NETCONF/YANG rather than negotiated via dynamic MSRP, Cyphal v1.1 fits cleanly. The CNC enumerates pinned subjects, computes reservations, loads them into bridges; SRP isn't actively used at runtime. Most modern industrial TSN deployments work this way (IEC/IEEE 60802) and Cyphal is compatible. The conflicts above only fire in distributed MSRP/MMRP deployments.
  • Application-driven periodic publication on pinned subjects. MaxFrameSize = MTU, MaxIntervalFrames = 1 per measurement interval at SR-A or SR-B. Once the conflict-1 stability issue is resolved, this is the canonical SRP-compatible pattern.
  • Reverse-path reservations for msg_ack. Adding msg_ack streams (subscriber → publisher unicast) to the SRP configuration is straightforward; they share the SR class of the data stream they acknowledge per §2.10.4. Doubles the reservation count for reliable subjects but no protocol change needed.

What doesn't work, and probably shouldn't be forced to

  • Gossip and scout traffic in SR classes. Gossip's ~5s period is six orders of magnitude slower than SR-A's 125µs measurement interval; reserving SR bandwidth for it is pure waste. CRDT and discovery traffic belongs in best-effort. The spec should say so explicitly.
  • Reliable retransmission storms under tight SR budget. §2.10.3's exponential backoff is fine for best-effort but spikes against an SR budget at the moment of greatest stress (loss event). Either size the SR budget for the worst-case retransmission burst (wasteful) or accept that reliable msg_rel won't get the full SR latency guarantee during retransmission (compromises the value of putting it in SR). Either choice is reasonable; the spec should pick one and document it.
  • Streaming RPC under SRP. Response streams from a server are unbounded per request. SRP advertisements need a fixed traffic profile. Don't put streaming RPC into an SR class; keep it in best-effort or in a separately-shaped class (e.g., a TAS slot rather than CBS).

What I'd add to v1.1-beta for SRP

In priority order:

  1. Resolve the pinning-vs-arbitration conflict so pinned subjects are stable enough to anchor TSN reservations. The "reserved range immune to auto-allocation" variant is the least invasive and the easiest to prove convergent.
  2. Document the integration model. State explicitly that v1.1 is designed for centralized-CNC TSN deployments; dynamic MSRP/MMRP is supported but suboptimal because the protocol cannot self-generate advertisements. This is a one-paragraph addition to §3.4 and immediately answers integrator questions.
  3. Bound pattern-subscription scope to prevent MMRP attribute exhaustion. Either a hard cap or a depth-bounded pattern syntax.
  4. Optional stream-class hint in the gossip header. Smallest useful step toward in-protocol traffic-profile advertisement; doesn't require CRDT changes if treated as advisory.

Cyphal/UDP v1.1 over 10BASE-T1S multidrop (IEEE 802.3cg)

Compatible at the protocol level — Cyphal/UDP runs over any Ethernet that carries IP, and 10BASE-T1S multidrop is no exception. But the deployment characteristics are very different from switched Ethernet, and a few v1.1 design choices interact with the shared-medium PLCA discipline in ways worth flagging. No spec changes are strictly required; one of the documentation gaps I've been collecting becomes more visible here.

What fits naturally

A few v1.1 design choices actually align better with multidrop than with switched Ethernet:

  • The "every node subscribes to the broadcast subject" rule (§2.4.3) is essentially free on a shared bus. On a switched fabric, the broadcast subject is real multicast that occupies bandwidth on every segment; on a multidrop bus, every frame already reaches every node, so the broadcast subject costs nothing additional beyond its own bandwidth share.
  • The "learn unicast endpoint from observed traffic" model (§3.4.2.2) that I criticized as a regression for TSN pre-provisioning works fine here. There are no bridges to pre-configure, and every node passively observes every other node's traffic on the bus. First-contact unicast just works.
  • Per-subject multicast group separation that doesn't help on the wire (all frames reach all nodes regardless) still helps the receiving IP stack do its filtering quickly. Cyphal frames destined for a multicast group the node hasn't joined get rejected at the IP stack with no application impact.

What hurts

Three bandwidth concerns and one protocol-level concern.

Bandwidth: gossip overhead on a 10 Mbit/s shared bus

PLCA divides the 10 Mbit/s gross bandwidth across N nodes. After Ethernet overhead, PLCA cycle overhead, IP/UDP headers (28 B), and the Cyphal/UDP header (32 B), per-node sustained payload bandwidth on an 8-node bus is in the ~600-800 kbit/s range. Cyphal's per-topic gossip default (~5 s, §2.6.2) gives a budget cost of roughly K topics × ~96 bps per node = trivially small for small networks, but for K = 1000 topics it's ~10% of a single node's allotted bandwidth just for protocol overhead. The gossip-shard optimization (§2.6.1) that lets nodes join only the shards they care about gives no bus-bandwidth savings (every frame is on the wire regardless) — its benefit is CPU/IP-stack filtering only.

This isn't a protocol problem; it's a deployment-sizing constraint. The spec should say: for shared-medium transports (10BASE-T1S multidrop, similar half-duplex links), total topic count should be sized so that aggregate gossip rate does not exceed a small fraction of per-node bandwidth allotment; the default gossip period may need to be increased to 30 s or more on heavily loaded buses.

Bandwidth: reliable msg_rel ACK storms

§2.10's reliable model has every subscriber emit a msg_ack within the ACK baseline timeout (~16 ms). On a switched network this is N parallel unicasts on separate egress ports; on a multidrop bus, N nodes contend serially for PLCA opportunities. For a 7-subscriber reliable topic on an 8-node bus, that's 7 ACK frames serialized in a window of ~16 ms. PLCA cycle time depends on configuration but typically resolves this in a few hundred microseconds, so the timeout isn't violated — but bus utilization spikes briefly.

The unicast retransmission fallback (§2.10.3) makes this worse: a publisher that loses ACKs from M holdouts unicasts retransmissions to each, generating another wave. On a shared bus this is the same M frames; on a switched bus the parallel egress paths would absorb them more easily.

Documentation: the spec should note that reliable subjects with large subscriber counts impose proportionally larger bus-time costs on shared-medium transports, and recommend that high-fanout reliable subjects either be made best-effort on multidrop segments or that subscriber counts be kept small.

Bandwidth: PLCA priority blindness

10BASE-T1S multidrop has no per-frame priority mechanism on the bus itself. PLCA gives each node a fair, round-robin transmission opportunity; there's no equivalent of CAN's arbitration-by-priority. Cyphal's eight transfer priority levels (§3.2.3), the DSCP mapping (§3.4.3), and any PCP mapping that v1.1-beta might add are all unenforced on the multidrop segment. An "exceptional" priority message and an "optional" priority message contend identically for the next PLCA opportunity.

This matters because integrators reading §3.4.3 may believe their DSCP-mapped traffic shaping is end-to-end. On a multidrop bus it isn't — priority enforcement starts at the bridge (if any) where multidrop meets switched Ethernet. The §3.4.3 text would benefit from a note: DSCP and any PCP mapping affect bridge behavior only; on shared-medium segments such as 10BASE-T1S multidrop, traffic prioritization is provided by the PHY's medium-access scheme rather than by the Cyphal priority level.

This also means the "exceptional" priority self-destruct example in §3.2.3 doesn't actually preempt other traffic on a multidrop bus — the self-destruct message waits for its PLCA opportunity like everything else. Worth knowing if the integrator's threat model includes "highest-priority emergency message must reach all receivers in <X µs."

Protocol: bounded reassembly hold-time, revisited

PLCA serializes multi-frame transfers across opportunities. A 100-frame transfer on an 8-node bus with ~100 µs PLCA cycle time takes ~1.25 ms minimum — easily survivable. But under contention (high topic count, multiple publishers competing for opportunities), the inter-frame gap of a single Cyphal transfer can grow to tens of milliseconds. Receivers need to hold reassembly state for the longest plausible inter-frame gap.

This is the same bounded reassembly hold-time gap I raised under CBS, but multidrop gives it a different worst-case profile (PLCA-bounded jitter is deterministic; CBS jitter is statistical). The fix is the same: §3.4.4 or §3.4.5 should specify a recommended reassembly hold-time bound. Without one, an implementer who picks "100 ms" because they tested on a switched network may find frames being dropped on a saturated multidrop bus.

Should v1.1 do anything specific for 10BASE-T1S?

In priority order:

  1. Document the bounded reassembly hold-time. Same fix that helps CBS and PSFP and pretty much every other deployment; the multidrop case just makes the under-specification more visible because the worst-case inter-frame gap is bounded by PLCA cycle × topology rather than by statistical bridge behavior.
  2. Document that priority shaping does not apply on shared-medium segments. A one-sentence note in §3.4.3 closes a real integrator misconception.
  3. Add a "shared-medium deployment profile" subsection with recommended tuning: longer gossip period, smaller reliable-subject subscriber count, MTU sizing relative to PLCA opportunity. Non-normative, but practical.
  4. Note that gossip-shard optimization is CPU-only, not bandwidth, on shared media. The §2.6.1 rationale is written assuming switched topology; the wording should acknowledge that the optimization changes character on shared media.

None of these are protocol changes — they're documentation. The v1.1 wire protocol works correctly on 10BASE-T1S multidrop as written; the gaps are all about helping integrators size and configure the deployment.

@pavel-kirienko

pavel-kirienko commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

A TSN PoC is long overdue and I would certainly be eager to co-work on that. This is good.

  1. Restore a deterministic unicast endpoint (or make UID-derived)
    If only one change makes it into v1.1-beta, make it Legal statement #1 (deterministic unicast endpoint)

I want to write down what we discussed at the call. Whatever solution is found for TSN, it should not compromise the "lab" side of the "lab-to-production" spectrum. Going from wide 64-bit UID (EUI-64-style) back to narrow allocatable/assignable identifiers would compromise zero-configuration use cases. Aside from the very practical advantages of not requiring node-ID allocation and allowing an arbitrary number of nodes to coexist on the same IP host without manual address deconflicting (each node gets its own UDP port provided by the OS IP stack, and other nodes discover it), the architectural advantage that was apparently overlooked by Claude is that we no longer attempt to do the job of DHCP or static network planning tools -- that is explicitly out of the scope.

Active misbehavior under TSN conditions: reliable retransmission against a flow meter (PSFP), uncontrolled scout responses against any policer or attribute table (PSFP and SRP), uncontrolled pattern-match scope (SRP).

Burst patterns the spec should document for CBS sizing

Conflict 1: Reliable retransmission can sustain a feedback loop against a flow meter

Conflict 2: Scout has no built-in rate limit on responses

I want to remark that retransmission, scouts, and pattern matching are optional features as far as the protocol is concerned. Perhaps their optionality should be made more explicit in the specification. I expect that the "lab"-profile nodes would always support that -- for, say, ROS-like applications these might be essential features. More "vehicle"-profile nodes can omit them since in a statically configured and well-scheduled network transfers usually don't go missing and dynamic discovery is unlikely to be wanted.

Regarding the scout: it is semi-stateless hence no timeout is applicable; I think this concern is misplaced. CRDT state processing is the same regardless of what triggered the message -- scout, repair, periodic.

Design-level mismatch (new with SRP): the CRDT's dynamic subject-ID allocation fights SRP's stable-reservation model. Pinning is the intended escape hatch, but §2.5.6's "request, not guarantee" semantics defeats it.

In a fully statically configured network, pinning is robust since the only mode where a request cannot be granted is when there are pre-existing allocations of the topic under a different pin (that would imply an unsound configuration).

  1. Carve the auto-allocated subject-ID range into redundancy/QoS classes

I thought TSN vehicle-profile networks would always pin critical topics; is that not a reasonable restriction? If yes, then this becomes a non-issue.

As a side note, I want to warn that while the consensus protocol can of course be altered, doing so is likely to increase the PoC bringup effort by an order of magnitude.

  1. Bound the replay-cache state per sender_uid (NB this is actually per session, not per sender ID)
  2. Replace "occasional reordering tolerance" with an explicit reorder bound
    §3.4.6 says randomized 48-bit transfer_id "obviates the need for a transfer-ID timeout at the receiver." But the spec is silent on reassembly-state aging for in-flight multi-frame transfers.

I left some parameters unspecified where I wasn't sure what sensible defaults are going to look like. It is certainly a good idea to provide at least some reasoning how one could size them, if not direct recommendations.

FYI, in libudpard the transfer-ID history buffer is currently 32 entries long per session, and the maximum number of incomplete in-progress transfers equals the number of priority levels (to accommodate the unlikely worst case of 8-level priority preemption nesting). And the stale reassembly timeout is on the range of tens of seconds (huge).

CRDT age must come from a monotonic local clock, not gPTP-disciplined wall time

Good point, this is my pet peeve because I often see wall clock misused where monotonic clock is needed. Needs spelling out.

Transfer-ID seed entropy (§3.4.6). [...] but worth being explicit that the time component alone is insufficient on a synchronized network.

Misplaced conern -- multiple nodes reusing the same transfer-ID sequence is a non-issue as long as their IDs are unique (otherwise the network would be dysfunctional for a different reason).

A protocol-level fix: §2.7 should specify a per-responder cap on the response rate to a single scout [..] This makes scout safe to deploy on a policed network without PSFP picking up the slack.

Untenable for small embedded nodes.

One subtle interaction
PSFP can disable a Stream Filter entirely on sustained violation. If that happens to a stream serving a pinned subject, the Cyphal publisher experiences total loss and the receiver experiences none — but the publisher has no way to know its stream was disabled vs. that nobody is listening. The CRDT may respond by evicting the topic...

This part is gibberish so I skipped the rest. Overall it seems that the analysis has degraded from sensible to lunacy somewhere halfway. I suggest clipping the part at and after "Transfer-ID seed entropy" to avoid confusion.

@pavel-kirienko

Copy link
Copy Markdown
Member Author

I think that moving forward it would be best to avoid pasting raw LLM output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants