Skip to content

Restrict SONiC SNMP and gNMI access to the OOB management network#2338

Open
berendt wants to merge 2 commits into
mainfrom
implement/issue-2330-snmp-gnmi-ctrlplane-acls
Open

Restrict SONiC SNMP and gNMI access to the OOB management network#2338
berendt wants to merge 2 commits into
mainfrom
implement/issue-2330-snmp-gnmi-ctrlplane-acls

Conversation

@berendt

@berendt berendt commented Jun 10, 2026

Copy link
Copy Markdown
Member

Closes #2330

What this does

The generated SONiC ConfigDB bound the SNMP agent to the OOB IP in the mgmt VRF but never restricted which source networks can reach it, and the gNMI/telemetry server (static TELEMETRY|gnmi base entry) listened on all interfaces with no restriction at all.

This PR emits, per device with an OOB IP, control-plane ACLs handled by caclmgrd: ACL_TABLE entries of type CTRLPLANE bound to the SNMP and EXTERNAL_CLIENT (gNMI) services, each with an ACL_RULE that accepts only the device's network-normalised OOB management subnet. Binding a service in a CTRLPLANE table makes caclmgrd install an implicit default-drop for that service, so all other sources lose access.

Commit walkthrough

  1. Restrict SONiC SNMP and gNMI access to the OOB network — adds the shared _add_ctrlplane_acls helper (the home for Ensure that the SONiC SSH service is accessible only on the management network #2329's SSH_ONLY as well), registers ACL_TABLE/ACL_RULE as generator-owned on-demand tables (dropped from the base config up front, rebuilt on every regen, absent without an OOB IP), wires the helper into generate_sonic_config inside the management-interface block, and covers the helper with unit tests (tables/rules emitted, network-normalised SRC_IP, TELEMETRY|gnmi|port handling, wholesale replacement of pre-existing entries, non-IPv4 skip, ownership registration).
  2. Add orchestrator wiring tests for control-plane ACLs — patches _add_ctrlplane_acls in the orchestrator helper fixture like the other section helpers and pins the wiring: called exactly once with (config, oob_ip, prefix_len) when an OOB IP exists; not invoked (and the owned tables stay absent, including stale base-config entries) when there is none.

Deviations from the issue

  • L4_DST_PORT added to the GNMI_ONLY rule. The issue's example rule carries only PRIORITY/PACKET_ACTION/SRC_IP/IP_TYPE, and its note says EXTERNAL_CLIENT "maps to the configured TELEMETRY|gnmi|port". The issue asked to verify this against caclmgrd before relying on it — verification against sonic-host-services (202211, 202305, 202311, 202405, master) shows no TELEMETRY/GNMI lookup exists in any of these versions: EXTERNAL_CLIENT has no built-in destination port, the rule must provide it via L4_DST_PORT, and a CTRLPLANE table whose rules yield no destination port is skipped entirely (syslog warning only) — the issue's exact example would have been a silent no-op for gNMI. The generator therefore performs the TELEMETRY|gnmi|port mapping itself and writes it into the rule, falling back to the telemetry container default 8080 when the base config sets no port. The SNMP rule needs no port (fixed 161 in caclmgrd's ACL_SERVICES).
  • Non-IPv4 OOB IPs skip ACL generation with a warning instead of crashing. The issue assumed IPv4Network(...) unconditionally; get_device_oob_ip can legitimately return an IPv6 address (NetBox oob_ip or a mgmt interface with only a v6 address), and an unguarded IPv4Network would abort the device's entire config generation. The skip path is logged and unit-tested.
  • SSH_ONLY is not emitted here — that is Ensure that the SONiC SSH service is accessible only on the management network #2329's scope (still open). The helper is written and documented as the shared home for it; adding SSH is a one-entry change.
  • No sonic-acl YANG model added. The issue's "Consider adding..." relates to Cover Enterprise SONiC tables in the YANG model set #2258; the validator continues to report ACL_TABLE/ACL_RULE as warnings ("No YANG schema for table..."), not errors, exactly as the issue describes.

Verification

  • pipenv run pytest — 1150 passed (full unit suite; includes the existing exhaustive owned-table stale-drop test, which automatically covers the two new owned tables)
  • pipenv run flake8 on the changed files — clean
  • caclmgrd semantics verified against sonic-net/sonic-host-services branches 202211 → master (ACL_SERVICES, EXTERNAL_CLIENT port handling, is_rule_ipv4, implicit drop)

🤖 Generated with Claude Code

@berendt berendt requested review from ideaship and osfrickler June 10, 2026 07:39
@berendt berendt marked this pull request as ready for review June 10, 2026 07:39

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="osism/tasks/conductor/sonic/config_generator.py" line_range="2289-2295" />
<code_context>
+    TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
+    follows the same lookup against the (pass-through) TELEMETRY table.
+    """
+    gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {})
+    return str(gnmi_config.get("port", DEFAULT_GNMI_PORT))
+
+
</code_context>
<issue_to_address>
**suggestion:** Consider validating that the gNMI port is numeric before using it as L4_DST_PORT.

If `TELEMETRY|gnmi|port` is set to a non-numeric value, `L4_DST_PORT` will be invalid and caclmgrd may ignore or mis-handle the rule. Please validate that the port is numeric (int or numeric string) and either fall back to `DEFAULT_GNMI_PORT` or log a warning when it isn’t, so ACL behavior remains predictable under bad TELEMETRY config.

```suggestion
def _get_gnmi_port(config):
    """Return the gNMI server port for the GNMI_ONLY control-plane ACL rule.

    The telemetry/gNMI container reads its listen port from
    TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
    follows the same lookup against the (pass-through) TELEMETRY table.

    If TELEMETRY|gnmi|port is set to a non-numeric value, fall back to
    DEFAULT_GNMI_PORT and log a warning so ACL behavior remains predictable.
    """
    gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {})
    port_value = gnmi_config.get("port")

    # No port configured: use the default.
    if port_value is None:
        return DEFAULT_GNMI_PORT

    # Accept both integer and numeric string configuration.
    try:
        port_int = int(port_value)
    except (TypeError, ValueError):
        logger.warning(
            "Invalid TELEMETRY|gnmi|port=%r in config; falling back to default gNMI port %s",
            port_value,
            DEFAULT_GNMI_PORT,
        )
        return DEFAULT_GNMI_PORT

    # Optionally guard against clearly invalid port ranges while keeping behavior predictable.
    if not 0 < port_int < 65536:
        logger.warning(
            "Out-of-range TELEMETRY|gnmi|port=%d in config; falling back to default gNMI port %s",
            port_int,
            DEFAULT_GNMI_PORT,
        )
        return DEFAULT_GNMI_PORT

    return str(port_int)
```
</issue_to_address>

### Comment 2
<location path="osism/tasks/conductor/sonic/config_generator.py" line_range="2322-2327" />
<code_context>
+    logs a warning and emits nothing rather than failing the whole config
+    generation.
+    """
+    network = ipaddress.ip_network(f"{oob_ip}/{prefix_len}", strict=False)
+    if network.version != 4:
+        logger.warning(
+            f"OOB IP {oob_ip}/{prefix_len} is not IPv4; skipping control-plane ACLs"
+        )
+        return
+
+    accept_from_oob = {
</code_context>
<issue_to_address>
**issue:** Guard against invalid OOB IPs raising ValueError during ip_network() construction.

If `oob_ip` or `prefix_len` are malformed, `ipaddress.ip_network` will raise `ValueError` and abort config generation. Consider wrapping this call in a `try/except ValueError`, logging a warning (similar to the non-IPv4 case), and returning early so bad input only skips ACL generation rather than breaking the whole config.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +2289 to +2295
def _get_gnmi_port(config):
"""Return the gNMI server port for the GNMI_ONLY control-plane ACL rule.

The telemetry/gNMI container reads its listen port from
TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
follows the same lookup against the (pass-through) TELEMETRY table.
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider validating that the gNMI port is numeric before using it as L4_DST_PORT.

If TELEMETRY|gnmi|port is set to a non-numeric value, L4_DST_PORT will be invalid and caclmgrd may ignore or mis-handle the rule. Please validate that the port is numeric (int or numeric string) and either fall back to DEFAULT_GNMI_PORT or log a warning when it isn’t, so ACL behavior remains predictable under bad TELEMETRY config.

Suggested change
def _get_gnmi_port(config):
"""Return the gNMI server port for the GNMI_ONLY control-plane ACL rule.
The telemetry/gNMI container reads its listen port from
TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
follows the same lookup against the (pass-through) TELEMETRY table.
"""
def _get_gnmi_port(config):
"""Return the gNMI server port for the GNMI_ONLY control-plane ACL rule.
The telemetry/gNMI container reads its listen port from
TELEMETRY|gnmi|port and falls back to 8080 when unset, so the ACL rule
follows the same lookup against the (pass-through) TELEMETRY table.
If TELEMETRY|gnmi|port is set to a non-numeric value, fall back to
DEFAULT_GNMI_PORT and log a warning so ACL behavior remains predictable.
"""
gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {})
port_value = gnmi_config.get("port")
# No port configured: use the default.
if port_value is None:
return DEFAULT_GNMI_PORT
# Accept both integer and numeric string configuration.
try:
port_int = int(port_value)
except (TypeError, ValueError):
logger.warning(
"Invalid TELEMETRY|gnmi|port=%r in config; falling back to default gNMI port %s",
port_value,
DEFAULT_GNMI_PORT,
)
return DEFAULT_GNMI_PORT
# Optionally guard against clearly invalid port ranges while keeping behavior predictable.
if not 0 < port_int < 65536:
logger.warning(
"Out-of-range TELEMETRY|gnmi|port=%d in config; falling back to default gNMI port %s",
port_int,
DEFAULT_GNMI_PORT,
)
return DEFAULT_GNMI_PORT
return str(port_int)

Comment on lines +2322 to +2327
network = ipaddress.ip_network(f"{oob_ip}/{prefix_len}", strict=False)
if network.version != 4:
logger.warning(
f"OOB IP {oob_ip}/{prefix_len} is not IPv4; skipping control-plane ACLs"
)
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Guard against invalid OOB IPs raising ValueError during ip_network() construction.

If oob_ip or prefix_len are malformed, ipaddress.ip_network will raise ValueError and abort config generation. Consider wrapping this call in a try/except ValueError, logging a warning (similar to the non-IPv4 case), and returning early so bad input only skips ACL generation rather than breaking the whole config.

berendt added 2 commits June 15, 2026 09:17
The generated ConfigDB binds the SNMP agent to the OOB IP in the mgmt
VRF but never restricts which source networks can reach it, and the
gNMI/telemetry server from the static TELEMETRY|gnmi base entry
listens on all interfaces with no restriction at all.

Emit per-device control-plane ACLs handled by caclmgrd: ACL_TABLE
entries of type CTRLPLANE bound to the SNMP and EXTERNAL_CLIENT
(gNMI) services, each with an ACL_RULE accepting only the device's
network-normalised OOB management subnet. Binding a service in a
CTRLPLANE table makes caclmgrd install an implicit default-drop for
it, so all other sources lose access.

caclmgrd's EXTERNAL_CLIENT service carries no built-in destination
port (verified against the ACL_SERVICES map in sonic-host-services
202211 through master): the rule must provide it via L4_DST_PORT, or
caclmgrd skips the whole table with only a syslog warning. The
GNMI_ONLY rule therefore sets L4_DST_PORT from TELEMETRY|gnmi|port,
falling back to the telemetry container default 8080.

ACL_TABLE and ACL_RULE become generator-owned on-demand tables: they
are dropped from the base config up front and rebuilt only when the
device has an IPv4 OOB IP, so stale entries cannot survive a regen
and devices without an OOB IP get no ACL tables at all.

The new tables are not yet covered by the generated YANG schemas, so
the validator reports them as warnings (validation skipped), not
errors.

Closes #2330

AI-assisted: Claude Code
Signed-off-by: Christian Berendt <berendt@osism.tech>
Patch _add_ctrlplane_acls in the orchestrator helper fixture like the
other section helpers, so the orchestrator glue stays exercised in
isolation, and pin the wiring for #2330: the helper runs exactly once
with the OOB IP and prefix when one exists, and is not invoked when
the device has no OOB IP — in which case the owned ACL_TABLE and
ACL_RULE tables stay absent even if the base config carried stale
entries.

AI-assisted: Claude Code
Signed-off-by: Christian Berendt <berendt@osism.tech>
@berendt berendt force-pushed the implement/issue-2330-snmp-gnmi-ctrlplane-acls branch from 1b5e04e to 5f30a86 Compare June 15, 2026 07:18
@berendt

berendt commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Fix Report (iteration 1)

Per check

  • osism/check → python-osism-unit-tests
    • Category: test
    • Root cause: Two static ownership-guard tests in test_config_generator_ownership.py failed against the new code in commit 21b9e6f:
      1. TestStaticTableReferenceGuard::test_every_referenced_table_is_classified — the new _get_gnmi_port reads config.get("TELEMETRY", …), but IMAGE_CONSUMED_TABLE_KEYS was still (), leaving TELEMETRY unclassified.
      2. TestMultiOwnerTableGuard::test_no_wholesale_reassignment_of_multi_owner_tables_add_ctrlplane_acls did config["ACL_TABLE"] = {…} / config["ACL_RULE"] = {…}, rebinding tables listed in the pre-existing MULTI_OWNER_OWNED_TABLE_KEYS guard (armed on main specifically to force the per-key merge pattern so the future SSH helper (Ensure that the SONiC SSH service is accessible only on the management network #2329) composes).
    • Fix: osism/tasks/conductor/sonic/config_generator.py — (1) IMAGE_CONSUMED_TABLE_KEYS = ("TELEMETRY",) + comment/docstring sync ((pass-through)(image-consumed)); (2) _add_ctrlplane_acls now merges per key (config.setdefault("ACL_TABLE", {}) then config["ACL_TABLE"]["SNMP_ONLY"] = …, etc.) instead of wholesale reassignment, with a docstring note on the multi-owner contract. tests/.../test_config_generator_ctrlplane_acls.py — rewrote the now-outdated …replaces_preexisting_entries_wholesale test (which assumed the helper itself drops pre-existing entries) into …merges_into_co_owned_tables_per_key, verifying a sibling helper's SSH_ONLY entries survive while the helper adds its own — the contract the guard enforces (stale removal is the central drop's job in generate_sonic_config).
    • Local verification: Full pytest not runnable here (pbr/pytest not installed; environment isn't set up for the suite, per the project's let-CI-run-tests workflow). Verified instead by (a) replicating the two guards' exact AST detectors against the patched source — unclassified referenced tables: [] and wholesale offenders: none; image-consumed disjoint from owned and inherited; mutation backstop reports no offenders; and (b) executing the patched _get_gnmi_port/_add_ctrlplane_acls in isolation against every scenario in the test file — all assertions pass. py_compile OK.
    • Regression test: Covered by the existing ownership guards (already green after the fix) and the rewritten per-key merge test in test_config_generator_ctrlplane_acls.py.

Diff summary

  • Files: osism/tasks/conductor/sonic/config_generator.py, tests/unit/tasks/conductor/sonic/test_config_generator_ctrlplane_acls.py
  • Approx lines added/removed: +45 / -31
  • Commit strategy: folded into 21b9e6f Restrict SONiC SNMP and gNMI access to the OOB network (now 92a2a14) via git commit --fixup + git rebase --autosquash onto the merge-base; both edited files were introduced by that commit. Branch published with --force-with-lease.

Status

STATUS: DONE


Fix report generated by planwerk-review fix with Claude Code

rather than failing the whole config generation.
"""
network = ipaddress.ip_network(f"{oob_ip}/{prefix_len}", strict=False)
if network.version != 4:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IPv6 path leaves SNMP and gNMI open — is that the intended behaviour?

For a device with an IPv6 OOB address this returns without emitting any SNMP/gNMI CTRLPLANE table, so both services stay reachable from every source while generate_sonic_config reports success. The docstring here (and the matching test at line 100) show this is deliberate — emit nothing rather than raise, to avoid failing config generation. So this isn't a question of whether you considered IPv6, but whether unrestricted is the outcome you want for those devices.

The reason to double-check: the only signal that hardening was skipped is a warning log, which an operator running a sync may not notice — so an IPv6-managed switch can silently end up unhardened. The alternatives are to fail loudly (loud enough that the operator actually sees it) or to implement an IPv6 filter. I'd flagged this fail-open pattern in #2340; that was a reviewer note, not a binding decision, so if you've weighed it and want this behaviour, just say so — it seemed worth surfacing rather than letting it land silently.

follows the same lookup against the (image-consumed) TELEMETRY table.
"""
gnmi_config = config.get("TELEMETRY", {}).get("gnmi", {})
return str(gnmi_config.get("port", DEFAULT_GNMI_PORT))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_gnmi_port returns "None" for a present-but-null port, and the resulting rule silently leaves gNMI open.

.get("port", DEFAULT_GNMI_PORT) only substitutes the default when the key is absent. With TELEMETRY.gnmi.port: null the key is present with value None, so this returns str(None)"None"; "" stays "", and non-numeric or out-of-range values ("not-a-port", 70000) pass straight through into GNMI_ONLY|RULE_1.L4_DST_PORT (line 2405). caclmgrd can't bind that as a port, so it skips the whole EXTERNAL_CLIENT table and gNMI ends up reachable from everywhere — again with the run reporting success.

That part is just a bug: the value should be validated where it's computed (the rule is present but malformed, so a downstream "is the rule there" check wouldn't catch it). What's a judgement call is the remedy — substitute the default 8080, or fail loudly. Substituting could mismatch the actual service port; failing loudly is only right if gNMI is actually enabled, so a device not running gNMI isn't rejected over an unused port. Which way do you want it?

# disjoint from OWNED_TABLE_KEYS -- an image-consumed table dropped up front
# would always read back empty.
IMAGE_CONSUMED_TABLE_KEYS = ()
IMAGE_CONSUMED_TABLE_KEYS = ("TELEMETRY",)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes the ownership docstring stale, and any replacement that names the members will rot the same way.

Adding TELEMETRY here means the image-consumed tier is no longer empty, but the generate_sonic_config ownership docstring (~line 203) still reads "Empty for now." — now contradicting the code. Rather than swap in "Currently TELEMETRY" (which goes stale the next time a table joins the tier), the docstring should describe the category and leave the membership to this constant, so there's one source of truth: drop "Empty for now." and, if anything, point at IMAGE_CONSUMED_TABLE_KEYS for the current members. (The constant's own comment above has the same problem — "and so far only" is a count that rots — but that's a separate cleanup.)

assert "L4_DST_PORT" not in config["ACL_RULE"]["SNMP_ONLY|RULE_1"]


def test_add_ctrlplane_acls_gnmi_port_follows_telemetry_table():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add negative-port coverage — this only exercises the two cases that already work.

The test covers a valid integer override (50051) and the absent-key default. Neither touches the inputs that break _get_gnmi_port: port: null, "", "not-a-port", 70000. Once the port-handling decision is made, add those as cases asserting the chosen behaviour (a raise, or substitution of the default). As-is the suite masks the defaulting bug rather than catching it.

}


def test_add_ctrlplane_acls_non_ipv4_oob_ip_emits_nothing():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test documents the IPv6 fail-open as intended — flagging it alongside the helper.

Asserting the tables are absent for an IPv6 OOB address pins the "emit nothing" choice as contract, which is what tells me it's deliberate rather than an oversight. No separate action here: if the helper's IPv6 behaviour changes (see the comment there), this test changes with it; if it stays, this is where the decision is recorded. Just tying the two together.

_add_loopback_configuration=patch("_add_loopback_configuration"),
_add_log_server_configuration=patch("_add_log_server_configuration"),
_add_snmp_configuration=patch("_add_snmp_configuration"),
_add_ctrlplane_acls=patch("_add_ctrlplane_acls"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one unmocked end-to-end test — the wiring tests patch the helper, so nothing proves ACLs reach the output.

These orchestrator tests mock _add_ctrlplane_acls and prove only that it's called with the right args. There's no single test where a base config sets TELEMETRY.gnmi.port, the real generate_sonic_config runs, and the final output is asserted to contain ACL_RULE["GNMI_ONLY|RULE_1"] with that port — i.e. the configured port surviving the central owned-table drop end-to-end. That integration is currently covered only compositionally (helper unit test + drop logic separately), so a regression between them would pass. (No need to re-assert TELEMETRY preservation — the disjointness guard in test_config_generator_ownership.py already covers that; the value here is the port-through-drop flow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

Ensure that the SONiC SNMP/GNMI services are accessible only on the management network

3 participants