Restrict SONiC SNMP and gNMI access to the OOB management network#2338
Restrict SONiC SNMP and gNMI access to the OOB management network#2338berendt wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="osism/tasks/conductor/sonic/config_generator.py" line_range="2289-2295" />
<code_context>
+ TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
+ follows the same lookup against the (pass-through) TELEMETRY table.
+ """
+ gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {})
+ return str(gnmi_config.get("port", DEFAULT_GNMI_PORT))
+
+
</code_context>
<issue_to_address>
**suggestion:** Consider validating that the gNMI port is numeric before using it as L4_DST_PORT.
If `TELEMETRY|gnmi|port` is set to a non-numeric value, `L4_DST_PORT` will be invalid and caclmgrd may ignore or mis-handle the rule. Please validate that the port is numeric (int or numeric string) and either fall back to `DEFAULT_GNMI_PORT` or log a warning when it isn’t, so ACL behavior remains predictable under bad TELEMETRY config.
```suggestion
def _get_gnmi_port(config):
"""Return the gNMI server port for the GNMI_ONLY control-plane ACL rule.
The telemetry/gNMI container reads its listen port from
TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
follows the same lookup against the (pass-through) TELEMETRY table.
If TELEMETRY|gnmi|port is set to a non-numeric value, fall back to
DEFAULT_GNMI_PORT and log a warning so ACL behavior remains predictable.
"""
gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {})
port_value = gnmi_config.get("port")
# No port configured: use the default.
if port_value is None:
return DEFAULT_GNMI_PORT
# Accept both integer and numeric string configuration.
try:
port_int = int(port_value)
except (TypeError, ValueError):
logger.warning(
"Invalid TELEMETRY|gnmi|port=%r in config; falling back to default gNMI port %s",
port_value,
DEFAULT_GNMI_PORT,
)
return DEFAULT_GNMI_PORT
# Optionally guard against clearly invalid port ranges while keeping behavior predictable.
if not 0 < port_int < 65536:
logger.warning(
"Out-of-range TELEMETRY|gnmi|port=%d in config; falling back to default gNMI port %s",
port_int,
DEFAULT_GNMI_PORT,
)
return DEFAULT_GNMI_PORT
return str(port_int)
```
</issue_to_address>
### Comment 2
<location path="osism/tasks/conductor/sonic/config_generator.py" line_range="2322-2327" />
<code_context>
+ logs a warning and emits nothing rather than failing the whole config
+ generation.
+ """
+ network = ipaddress.ip_network(f"{oob_ip}/{prefix_len}", strict=False)
+ if network.version != 4:
+ logger.warning(
+ f"OOB IP {oob_ip}/{prefix_len} is not IPv4; skipping control-plane ACLs"
+ )
+ return
+
+ accept_from_oob = {
</code_context>
<issue_to_address>
**issue:** Guard against invalid OOB IPs raising ValueError during ip_network() construction.
If `oob_ip` or `prefix_len` are malformed, `ipaddress.ip_network` will raise `ValueError` and abort config generation. Consider wrapping this call in a `try/except ValueError`, logging a warning (similar to the non-IPv4 case), and returning early so bad input only skips ACL generation rather than breaking the whole config.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def _get_gnmi_port(config): | ||
| """Return the gNMI server port for the GNMI_ONLY control-plane ACL rule. | ||
|
|
||
| The telemetry/gNMI container reads its listen port from | ||
| TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule | ||
| follows the same lookup against the (pass-through) TELEMETRY table. | ||
| """ |
There was a problem hiding this comment.
suggestion: Consider validating that the gNMI port is numeric before using it as L4_DST_PORT.
If TELEMETRY|gnmi|port is set to a non-numeric value, L4_DST_PORT will be invalid and caclmgrd may ignore or mis-handle the rule. Please validate that the port is numeric (int or numeric string) and either fall back to DEFAULT_GNMI_PORT or log a warning when it isn’t, so ACL behavior remains predictable under bad TELEMETRY config.
| def _get_gnmi_port(config): | |
| """Return the gNMI server port for the GNMI_ONLY control-plane ACL rule. | |
| The telemetry/gNMI container reads its listen port from | |
| TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule | |
| follows the same lookup against the (pass-through) TELEMETRY table. | |
| """ | |
| def _get_gnmi_port(config): | |
| """Return the gNMI server port for the GNMI_ONLY control-plane ACL rule. | |
| The telemetry/gNMI container reads its listen port from | |
| TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule | |
| follows the same lookup against the (pass-through) TELEMETRY table. | |
| If TELEMETRY|gnmi|port is set to a non-numeric value, fall back to | |
| DEFAULT_GNMI_PORT and log a warning so ACL behavior remains predictable. | |
| """ | |
| gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {}) | |
| port_value = gnmi_config.get("port") | |
| # No port configured: use the default. | |
| if port_value is None: | |
| return DEFAULT_GNMI_PORT | |
| # Accept both integer and numeric string configuration. | |
| try: | |
| port_int = int(port_value) | |
| except (TypeError, ValueError): | |
| logger.warning( | |
| "Invalid TELEMETRY|gnmi|port=%r in config; falling back to default gNMI port %s", | |
| port_value, | |
| DEFAULT_GNMI_PORT, | |
| ) | |
| return DEFAULT_GNMI_PORT | |
| # Optionally guard against clearly invalid port ranges while keeping behavior predictable. | |
| if not 0 < port_int < 65536: | |
| logger.warning( | |
| "Out-of-range TELEMETRY|gnmi|port=%d in config; falling back to default gNMI port %s", | |
| port_int, | |
| DEFAULT_GNMI_PORT, | |
| ) | |
| return DEFAULT_GNMI_PORT | |
| return str(port_int) |
| network = ipaddress.ip_network(f"{oob_ip}/{prefix_len}", strict=False) | ||
| if network.version != 4: | ||
| logger.warning( | ||
| f"OOB IP {oob_ip}/{prefix_len} is not IPv4; skipping control-plane ACLs" | ||
| ) | ||
| return |
There was a problem hiding this comment.
issue: Guard against invalid OOB IPs raising ValueError during ip_network() construction.
If oob_ip or prefix_len are malformed, ipaddress.ip_network will raise ValueError and abort config generation. Consider wrapping this call in a try/except ValueError, logging a warning (similar to the non-IPv4 case), and returning early so bad input only skips ACL generation rather than breaking the whole config.
e979fb3 to
1b5e04e
Compare
The generated ConfigDB binds the SNMP agent to the OOB IP in the mgmt VRF but never restricts which source networks can reach it, and the gNMI/telemetry server from the static TELEMETRY|gnmi base entry listens on all interfaces with no restriction at all. Emit per-device control-plane ACLs handled by caclmgrd: ACL_TABLE entries of type CTRLPLANE bound to the SNMP and EXTERNAL_CLIENT (gNMI) services, each with an ACL_RULE accepting only the device's network-normalised OOB management subnet. Binding a service in a CTRLPLANE table makes caclmgrd install an implicit default-drop for it, so all other sources lose access. caclmgrd's EXTERNAL_CLIENT service carries no built-in destination port (verified against the ACL_SERVICES map in sonic-host-services 202211 through master): the rule must provide it via L4_DST_PORT, or caclmgrd skips the whole table with only a syslog warning. The GNMI_ONLY rule therefore sets L4_DST_PORT from TELEMETRY|gnmi|port, falling back to the telemetry container default 8080. ACL_TABLE and ACL_RULE become generator-owned on-demand tables: they are dropped from the base config up front and rebuilt only when the device has an IPv4 OOB IP, so stale entries cannot survive a regen and devices without an OOB IP get no ACL tables at all. The new tables are not yet covered by the generated YANG schemas, so the validator reports them as warnings (validation skipped), not errors. Closes #2330 AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
Patch _add_ctrlplane_acls in the orchestrator helper fixture like the other section helpers, so the orchestrator glue stays exercised in isolation, and pin the wiring for #2330: the helper runs exactly once with the OOB IP and prefix when one exists, and is not invoked when the device has no OOB IP — in which case the owned ACL_TABLE and ACL_RULE tables stay absent even if the base config carried stale entries. AI-assisted: Claude Code Signed-off-by: Christian Berendt <berendt@osism.tech>
1b5e04e to
5f30a86
Compare
Fix Report (iteration 1)Per check
Diff summary
StatusSTATUS: DONE Fix report generated by planwerk-review fix with Claude Code |
| rather than failing the whole config generation. | ||
| """ | ||
| network = ipaddress.ip_network(f"{oob_ip}/{prefix_len}", strict=False) | ||
| if network.version != 4: |
There was a problem hiding this comment.
The IPv6 path leaves SNMP and gNMI open — is that the intended behaviour?
For a device with an IPv6 OOB address this returns without emitting any SNMP/gNMI CTRLPLANE table, so both services stay reachable from every source while generate_sonic_config reports success. The docstring here (and the matching test at line 100) show this is deliberate — emit nothing rather than raise, to avoid failing config generation. So this isn't a question of whether you considered IPv6, but whether unrestricted is the outcome you want for those devices.
The reason to double-check: the only signal that hardening was skipped is a warning log, which an operator running a sync may not notice — so an IPv6-managed switch can silently end up unhardened. The alternatives are to fail loudly (loud enough that the operator actually sees it) or to implement an IPv6 filter. I'd flagged this fail-open pattern in #2340; that was a reviewer note, not a binding decision, so if you've weighed it and want this behaviour, just say so — it seemed worth surfacing rather than letting it land silently.
| follows the same lookup against the (image-consumed) TELEMETRY table. | ||
| """ | ||
| gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {}) | ||
| return str(gnmi_config.get("port", DEFAULT_GNMI_PORT)) |
There was a problem hiding this comment.
_get_gnmi_port returns "None" for a present-but-null port, and the resulting rule silently leaves gNMI open.
.get("port", DEFAULT_GNMI_PORT) only substitutes the default when the key is absent. With TELEMETRY.gnmi.port: null the key is present with value None, so this returns str(None) → "None"; "" stays "", and non-numeric or out-of-range values ("not-a-port", 70000) pass straight through into GNMI_ONLY|RULE_1.L4_DST_PORT (line 2405). caclmgrd can't bind that as a port, so it skips the whole EXTERNAL_CLIENT table and gNMI ends up reachable from everywhere — again with the run reporting success.
That part is just a bug: the value should be validated where it's computed (the rule is present but malformed, so a downstream "is the rule there" check wouldn't catch it). What's a judgement call is the remedy — substitute the default 8080, or fail loudly. Substituting could mismatch the actual service port; failing loudly is only right if gNMI is actually enabled, so a device not running gNMI isn't rejected over an unused port. Which way do you want it?
| # disjoint from OWNED_TABLE_KEYS -- an image-consumed table dropped up front | ||
| # would always read back empty. | ||
| IMAGE_CONSUMED_TABLE_KEYS = () | ||
| IMAGE_CONSUMED_TABLE_KEYS = ("TELEMETRY",) |
There was a problem hiding this comment.
This change makes the ownership docstring stale, and any replacement that names the members will rot the same way.
Adding TELEMETRY here means the image-consumed tier is no longer empty, but the generate_sonic_config ownership docstring (~line 203) still reads "Empty for now." — now contradicting the code. Rather than swap in "Currently TELEMETRY" (which goes stale the next time a table joins the tier), the docstring should describe the category and leave the membership to this constant, so there's one source of truth: drop "Empty for now." and, if anything, point at IMAGE_CONSUMED_TABLE_KEYS for the current members. (The constant's own comment above has the same problem — "and so far only" is a count that rots — but that's a separate cleanup.)
| assert "L4_DST_PORT" not in config["ACL_RULE"]["SNMP_ONLY|RULE_1"] | ||
|
|
||
|
|
||
| def test_add_ctrlplane_acls_gnmi_port_follows_telemetry_table(): |
There was a problem hiding this comment.
Add negative-port coverage — this only exercises the two cases that already work.
The test covers a valid integer override (50051) and the absent-key default. Neither touches the inputs that break _get_gnmi_port: port: null, "", "not-a-port", 70000. Once the port-handling decision is made, add those as cases asserting the chosen behaviour (a raise, or substitution of the default). As-is the suite masks the defaulting bug rather than catching it.
| } | ||
|
|
||
|
|
||
| def test_add_ctrlplane_acls_non_ipv4_oob_ip_emits_nothing(): |
There was a problem hiding this comment.
This test documents the IPv6 fail-open as intended — flagging it alongside the helper.
Asserting the tables are absent for an IPv6 OOB address pins the "emit nothing" choice as contract, which is what tells me it's deliberate rather than an oversight. No separate action here: if the helper's IPv6 behaviour changes (see the comment there), this test changes with it; if it stays, this is where the decision is recorded. Just tying the two together.
| _add_loopback_configuration=patch("_add_loopback_configuration"), | ||
| _add_log_server_configuration=patch("_add_log_server_configuration"), | ||
| _add_snmp_configuration=patch("_add_snmp_configuration"), | ||
| _add_ctrlplane_acls=patch("_add_ctrlplane_acls"), |
There was a problem hiding this comment.
Add one unmocked end-to-end test — the wiring tests patch the helper, so nothing proves ACLs reach the output.
These orchestrator tests mock _add_ctrlplane_acls and prove only that it's called with the right args. There's no single test where a base config sets TELEMETRY.gnmi.port, the real generate_sonic_config runs, and the final output is asserted to contain ACL_RULE["GNMI_ONLY|RULE_1"] with that port — i.e. the configured port surviving the central owned-table drop end-to-end. That integration is currently covered only compositionally (helper unit test + drop logic separately), so a regression between them would pass. (No need to re-assert TELEMETRY preservation — the disjointness guard in test_config_generator_ownership.py already covers that; the value here is the port-through-drop flow.)
Closes #2330
What this does
The generated SONiC ConfigDB bound the SNMP agent to the OOB IP in the mgmt VRF but never restricted which source networks can reach it, and the gNMI/telemetry server (static
TELEMETRY|gnmibase entry) listened on all interfaces with no restriction at all.This PR emits, per device with an OOB IP, control-plane ACLs handled by
caclmgrd:ACL_TABLEentries of typeCTRLPLANEbound to theSNMPandEXTERNAL_CLIENT(gNMI) services, each with anACL_RULEthat accepts only the device's network-normalised OOB management subnet. Binding a service in a CTRLPLANE table makes caclmgrd install an implicit default-drop for that service, so all other sources lose access.Commit walkthrough
_add_ctrlplane_aclshelper (the home for Ensure that the SONiC SSH service is accessible only on the management network #2329'sSSH_ONLYas well), registersACL_TABLE/ACL_RULEas generator-owned on-demand tables (dropped from the base config up front, rebuilt on every regen, absent without an OOB IP), wires the helper intogenerate_sonic_configinside the management-interface block, and covers the helper with unit tests (tables/rules emitted, network-normalisedSRC_IP,TELEMETRY|gnmi|porthandling, wholesale replacement of pre-existing entries, non-IPv4 skip, ownership registration)._add_ctrlplane_aclsin the orchestrator helper fixture like the other section helpers and pins the wiring: called exactly once with(config, oob_ip, prefix_len)when an OOB IP exists; not invoked (and the owned tables stay absent, including stale base-config entries) when there is none.Deviations from the issue
L4_DST_PORTadded to theGNMI_ONLYrule. The issue's example rule carries onlyPRIORITY/PACKET_ACTION/SRC_IP/IP_TYPE, and its note saysEXTERNAL_CLIENT"maps to the configuredTELEMETRY|gnmi|port". The issue asked to verify this against caclmgrd before relying on it — verification againstsonic-host-services(202211, 202305, 202311, 202405, master) shows noTELEMETRY/GNMIlookup exists in any of these versions:EXTERNAL_CLIENThas no built-in destination port, the rule must provide it viaL4_DST_PORT, and a CTRLPLANE table whose rules yield no destination port is skipped entirely (syslog warning only) — the issue's exact example would have been a silent no-op for gNMI. The generator therefore performs theTELEMETRY|gnmi|portmapping itself and writes it into the rule, falling back to the telemetry container default8080when the base config sets no port. The SNMP rule needs no port (fixed161in caclmgrd'sACL_SERVICES).IPv4Network(...)unconditionally;get_device_oob_ipcan legitimately return an IPv6 address (NetBoxoob_ipor a mgmt interface with only a v6 address), and an unguardedIPv4Networkwould abort the device's entire config generation. The skip path is logged and unit-tested.SSH_ONLYis not emitted here — that is Ensure that the SONiC SSH service is accessible only on the management network #2329's scope (still open). The helper is written and documented as the shared home for it; adding SSH is a one-entry change.sonic-aclYANG model added. The issue's "Consider adding..." relates to Cover Enterprise SONiC tables in the YANG model set #2258; the validator continues to reportACL_TABLE/ACL_RULEas warnings ("No YANG schema for table..."), not errors, exactly as the issue describes.Verification
pipenv run pytest— 1150 passed (full unit suite; includes the existing exhaustive owned-table stale-drop test, which automatically covers the two new owned tables)pipenv run flake8on the changed files — cleansonic-net/sonic-host-servicesbranches 202211 → master (ACL_SERVICES,EXTERNAL_CLIENTport handling,is_rule_ipv4, implicit drop)🤖 Generated with Claude Code