Skip to content

feat: add POST /v1/databases/{id}/upgrade endpoint#418

Open
tsivaprasad wants to merge 1 commit into
PLAT-602-check-available-image-upgrades-for-a-databasefrom
PLAT-603-apply-image-upgrade-to-a-database
Open

feat: add POST /v1/databases/{id}/upgrade endpoint#418
tsivaprasad wants to merge 1 commit into
PLAT-602-check-available-image-upgrades-for-a-databasefrom
PLAT-603-apply-image-upgrade-to-a-database

Conversation

@tsivaprasad

@tsivaprasad tsivaprasad commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds POST /v1/databases/{id}/upgrade, a dedicated endpoint for applying minor-version image upgrades to a database. Validation ensures the target image is a stable manifest entry, belongs to the same PostgreSQL major and Spock major version bucket, and is strictly newer than the database's current version.

Changes

  • API: new apply-upgrade method with ApplyUpgradeRequest /
    ApplyUpgradeResponse Goa types; regenerated OpenAPI spec and transport
    layer
  • Task type: added TypeUpgrade for audit-trail distinction from
    TypeUpdate
  • Orchestrator interface: FindUpgrade(current, targetImage) method +
    ErrUpgradeNotAvailable sentinel error
  • Swarm: Versions.FindByImage reverse-lookup + FindUpgrade
    validation (same PG major, same Spock major, strictly newer, stable
    stability only)
  • Systemd: FindUpgrade returns ErrUpgradeNotAvailable directing
    users to the OS package manager
  • Database service: ApplyUpgrade validates, clears per-node version
    overrides, sets state to modifying, and commits atomically to etcd
  • Workflow service: UpgradeDatabase creates a TypeUpgrade task and
    reuses the existing UpdateDatabase workflow
  • HTTP handler: ApplyUpgrade in PostInitHandlers with prevState
    rollback if workflow trigger fails; pre-init stub returns
    ErrUninitialized

Testing

Verification:

  1. Created a cluster
  2. Create DB with create_db_lower_image.json
cp1-req create-database < ../demo/Images/create_db_lower_image.json | cp-follow-task
{
  "database": {
    "created_at": "2026-06-29T13:55:07Z",
    "id": "storefront",
    "spec": {
      "database_name": "storefront",
      "database_users": [
        {
          "attributes": [
            "SUPERUSER",
            "LOGIN"
          ],
          "db_owner": true,
          "username": "admin"
        }
      ],
      "nodes": [
        {
          "host_ids": [
            "host-1",
            "host-2",
            "host-3"
          ],
          "name": "n1"
        }
      ],
      "postgres_version": "17.9",
      "spock_version": "5"
    },
    "state": "creating",
    "updated_at": "2026-06-29T13:55:07Z"
  },
  "task": {
    "created_at": "2026-06-29T13:55:07Z",
    "database_id": "storefront",
    "entity_id": "storefront",
    "scope": "database",
    "status": "pending",
    "task_id": "019f13a9-905d-755e-80bc-fa9c8014eb4c",
    "type": "create"
  }
}
[2026-06-29T13:55:08Z] refreshing current state
[2026-06-29T13:55:09Z] finished refreshing current state (took 214.188791ms)
[2026-06-29T13:55:11Z] creating resource swarm.patroni_cluster::n1
[2026-06-29T13:55:11Z] finished creating resource swarm.patroni_cluster::n1 (took 30.208µs)
[2026-06-29T13:55:11Z] creating resource swarm.network::storefront-database
[2026-06-29T13:55:11Z] finished creating resource swarm.network::storefront-database (took 6.122542ms)
[2026-06-29T13:55:11Z] creating resource filesystem.dir::storefront-n1-689qacsi-instance
[2026-06-29T13:55:11Z] finished creating resource filesystem.dir::storefront-n1-689qacsi-instance (took 1.785959ms)
[2026-06-29T13:55:11Z] creating resource swarm.patroni_member::storefront-n1-689qacsi
[2026-06-29T13:55:11Z] finished creating resource swarm.patroni_member::storefront-n1-689qacsi (took 32.084µs)
[2026-06-29T13:55:11Z] creating resource filesystem.dir::storefront-n1-689qacsi-configs
[2026-06-29T13:55:11Z] finished creating resource filesystem.dir::storefront-n1-689qacsi-configs (took 1.357708ms)
[2026-06-29T13:55:11Z] creating resource filesystem.dir::storefront-n1-689qacsi-certificates
[2026-06-29T13:55:11Z] finished creating resource filesystem.dir::storefront-n1-689qacsi-certificates (took 1.168042ms)
[2026-06-29T13:55:11Z] creating resource filesystem.dir::storefront-n1-689qacsi-data
[2026-06-29T13:55:11Z] finished creating resource filesystem.dir::storefront-n1-689qacsi-data (took 864.959µs)
[2026-06-29T13:55:12Z] creating resource swarm.etcd_creds::storefront-n1-689qacsi
[2026-06-29T13:55:12Z] finished creating resource swarm.etcd_creds::storefront-n1-689qacsi (took 82.51ms)
[2026-06-29T13:55:12Z] creating resource swarm.postgres_certs::storefront-n1-689qacsi
[2026-06-29T13:55:12Z] finished creating resource swarm.postgres_certs::storefront-n1-689qacsi (took 6.868916ms)
[2026-06-29T13:55:12Z] creating resource swarm.patroni_config::storefront-n1-689qacsi
[2026-06-29T13:55:12Z] finished creating resource swarm.patroni_config::storefront-n1-689qacsi (took 2.16825ms)
[2026-06-29T13:55:13Z] creating resource swarm.postgres_service_spec::storefront-n1-689qacsi
[2026-06-29T13:55:13Z] finished creating resource swarm.postgres_service_spec::storefront-n1-689qacsi (took 184.5µs)
[2026-06-29T13:55:13Z] creating resource swarm.check_will_restart::storefront-n1-689qacsi
[2026-06-29T13:55:13Z] finished creating resource swarm.check_will_restart::storefront-n1-689qacsi (took 2.26775ms)
[2026-06-29T13:55:13Z] creating resource swarm.switchover::storefront-n1-689qacsi
[2026-06-29T13:55:13Z] finished creating resource swarm.switchover::storefront-n1-689qacsi (took 17.834µs)
[2026-06-29T13:55:13Z] creating resource swarm.postgres_service::storefront-n1-689qacsi
[2026-06-29T13:55:23Z] finished creating resource swarm.postgres_service::storefront-n1-689qacsi (took 10.015476588s)
[2026-06-29T13:55:24Z] creating resource database.instance::storefront-n1-689qacsi
[2026-06-29T13:55:34Z] finished creating resource database.instance::storefront-n1-689qacsi (took 10.091672005s)
[2026-06-29T13:55:34Z] creating resource filesystem.dir::storefront-n1-9ptayhma-instance
[2026-06-29T13:55:34Z] finished creating resource filesystem.dir::storefront-n1-9ptayhma-instance (took 753.875µs)
[2026-06-29T13:55:34Z] creating resource filesystem.dir::storefront-n1-ant97dj4-instance
[2026-06-29T13:55:34Z] finished creating resource filesystem.dir::storefront-n1-ant97dj4-instance (took 1.1615ms)
[2026-06-29T13:55:35Z] creating resource swarm.patroni_member::storefront-n1-ant97dj4
[2026-06-29T13:55:35Z] finished creating resource swarm.patroni_member::storefront-n1-ant97dj4 (took 10µs)
[2026-06-29T13:55:35Z] creating resource swarm.patroni_member::storefront-n1-9ptayhma
[2026-06-29T13:55:35Z] finished creating resource swarm.patroni_member::storefront-n1-9ptayhma (took 9.041µs)
[2026-06-29T13:55:35Z] creating resource filesystem.dir::storefront-n1-9ptayhma-certificates
[2026-06-29T13:55:35Z] finished creating resource filesystem.dir::storefront-n1-9ptayhma-certificates (took 1.4505ms)
[2026-06-29T13:55:35Z] creating resource filesystem.dir::storefront-n1-9ptayhma-data
[2026-06-29T13:55:35Z] finished creating resource filesystem.dir::storefront-n1-9ptayhma-data (took 467.417µs)
[2026-06-29T13:55:35Z] creating resource filesystem.dir::storefront-n1-9ptayhma-configs
[2026-06-29T13:55:35Z] finished creating resource filesystem.dir::storefront-n1-9ptayhma-configs (took 281.583µs)
[2026-06-29T13:55:35Z] creating resource filesystem.dir::storefront-n1-ant97dj4-configs
[2026-06-29T13:55:35Z] finished creating resource filesystem.dir::storefront-n1-ant97dj4-configs (took 612.583µs)
[2026-06-29T13:55:35Z] creating resource filesystem.dir::storefront-n1-ant97dj4-certificates
[2026-06-29T13:55:35Z] finished creating resource filesystem.dir::storefront-n1-ant97dj4-certificates (took 567.333µs)
[2026-06-29T13:55:35Z] creating resource filesystem.dir::storefront-n1-ant97dj4-data
[2026-06-29T13:55:35Z] finished creating resource filesystem.dir::storefront-n1-ant97dj4-data (took 419.375µs)
[2026-06-29T13:55:35Z] creating resource swarm.postgres_certs::storefront-n1-9ptayhma
[2026-06-29T13:55:35Z] creating resource swarm.etcd_creds::storefront-n1-9ptayhma
[2026-06-29T13:55:35Z] finished creating resource swarm.postgres_certs::storefront-n1-9ptayhma (took 16.2815ms)
[2026-06-29T13:55:35Z] finished creating resource swarm.etcd_creds::storefront-n1-9ptayhma (took 70.431042ms)
[2026-06-29T13:55:35Z] creating resource swarm.postgres_certs::storefront-n1-ant97dj4
[2026-06-29T13:55:35Z] creating resource swarm.etcd_creds::storefront-n1-ant97dj4
[2026-06-29T13:55:35Z] finished creating resource swarm.postgres_certs::storefront-n1-ant97dj4 (took 11.57575ms)
[2026-06-29T13:55:35Z] finished creating resource swarm.etcd_creds::storefront-n1-ant97dj4 (took 68.555458ms)
[2026-06-29T13:55:35Z] creating resource swarm.patroni_config::storefront-n1-9ptayhma
[2026-06-29T13:55:35Z] finished creating resource swarm.patroni_config::storefront-n1-9ptayhma (took 1.947875ms)
[2026-06-29T13:55:36Z] creating resource swarm.patroni_config::storefront-n1-ant97dj4
[2026-06-29T13:55:36Z] finished creating resource swarm.patroni_config::storefront-n1-ant97dj4 (took 3.315208ms)
[2026-06-29T13:55:36Z] creating resource swarm.postgres_service_spec::storefront-n1-9ptayhma
[2026-06-29T13:55:36Z] finished creating resource swarm.postgres_service_spec::storefront-n1-9ptayhma (took 129.292µs)
[2026-06-29T13:55:36Z] creating resource swarm.postgres_service_spec::storefront-n1-ant97dj4
[2026-06-29T13:55:36Z] finished creating resource swarm.postgres_service_spec::storefront-n1-ant97dj4 (took 132.375µs)
[2026-06-29T13:55:36Z] creating resource swarm.check_will_restart::storefront-n1-9ptayhma
[2026-06-29T13:55:36Z] finished creating resource swarm.check_will_restart::storefront-n1-9ptayhma (took 2.09825ms)
[2026-06-29T13:55:36Z] creating resource swarm.check_will_restart::storefront-n1-ant97dj4
[2026-06-29T13:55:36Z] finished creating resource swarm.check_will_restart::storefront-n1-ant97dj4 (took 1.333167ms)
[2026-06-29T13:55:36Z] creating resource swarm.switchover::storefront-n1-9ptayhma
[2026-06-29T13:55:36Z] finished creating resource swarm.switchover::storefront-n1-9ptayhma (took 19.208µs)
[2026-06-29T13:55:37Z] creating resource swarm.switchover::storefront-n1-ant97dj4
[2026-06-29T13:55:37Z] finished creating resource swarm.switchover::storefront-n1-ant97dj4 (took 11.375µs)
[2026-06-29T13:55:37Z] creating resource swarm.postgres_service::storefront-n1-ant97dj4
[2026-06-29T13:55:37Z] creating resource swarm.postgres_service::storefront-n1-9ptayhma
[2026-06-29T13:55:47Z] finished creating resource swarm.postgres_service::storefront-n1-ant97dj4 (took 10.011749254s)
[2026-06-29T13:55:47Z] finished creating resource swarm.postgres_service::storefront-n1-9ptayhma (took 10.011595795s)
[2026-06-29T13:55:47Z] creating resource database.instance::storefront-n1-9ptayhma
[2026-06-29T13:55:47Z] creating resource database.instance::storefront-n1-ant97dj4
[2026-06-29T13:55:57Z] finished creating resource database.instance::storefront-n1-9ptayhma (took 10.028190422s)
[2026-06-29T13:55:57Z] finished creating resource database.instance::storefront-n1-ant97dj4 (took 10.019370463s)
[2026-06-29T13:55:57Z] creating resource database.node::n1
[2026-06-29T13:55:57Z] finished creating resource database.node::n1 (took 129.042µs)
[2026-06-29T13:55:58Z] creating resource monitor.instance::storefront-n1-9ptayhma
[2026-06-29T13:55:58Z] finished creating resource monitor.instance::storefront-n1-9ptayhma (took 12.994ms)
[2026-06-29T13:55:58Z] creating resource monitor.instance::storefront-n1-689qacsi
[2026-06-29T13:55:58Z] creating resource monitor.instance::storefront-n1-ant97dj4
[2026-06-29T13:55:58Z] finished creating resource monitor.instance::storefront-n1-689qacsi (took 14.330083ms)
[2026-06-29T13:55:58Z] finished creating resource monitor.instance::storefront-n1-ant97dj4 (took 12.415958ms)
[2026-06-29T13:55:58Z] creating resource database.postgres_database::n1:storefront
[2026-06-29T13:55:59Z] finished creating resource database.postgres_database::n1:storefront (took 324.43375ms)

database entity storefront task 019f13a9-905d-755e-80bc-fa9c8014eb4c completed
  1. Verify current state
cp1-req get-database storefront | jq '{state, version: .spec.postgres_version}'
{
  "state": "available",
  "version": "17.9"
}
  1. 'apply-upgrade'
cp1-req apply-upgrade storefront \                                                  
  image: "ghcr.io/pgedge/pgedge-postgres:17.10-spock5.0.8-standard-1" | cp-follow-task
{
  "database": {
    "created_at": "2026-06-29T13:55:07Z",
    "id": "storefront",
    "instances": [
      {
        "created_at": "2026-06-29T13:55:10Z",
        "host_id": "host-1",
        "id": "storefront-n1-689qacsi",
        "node_name": "n1",
        "postgres": {
          "patroni_state": "running",
          "role": "primary",
          "version": "17.9"
        },
        "spock": {
          "read_only": "off",
          "version": "5.0.6"
        },
        "state": "available",
        "status_updated_at": "2026-06-29T13:56:43Z",
        "updated_at": "2026-06-29T13:55:34Z"
      },
      {
        "created_at": "2026-06-29T13:55:34Z",
        "host_id": "host-2",
        "id": "storefront-n1-9ptayhma",
        "node_name": "n1",
        "postgres": {
          "patroni_state": "running",
          "role": "replica",
          "version": "17.9"
        },
        "state": "available",
        "status_updated_at": "2026-06-29T13:56:43Z",
        "updated_at": "2026-06-29T13:55:57Z"
      },
      {
        "created_at": "2026-06-29T13:55:34Z",
        "host_id": "host-3",
        "id": "storefront-n1-ant97dj4",
        "node_name": "n1",
        "postgres": {
          "patroni_state": "running",
          "role": "replica",
          "version": "17.9"
        },
        "state": "available",
        "status_updated_at": "2026-06-29T13:56:43Z",
        "updated_at": "2026-06-29T13:55:57Z"
      }
    ],
    "spec": {
      "database_name": "storefront",
      "database_users": [
        {
          "attributes": [
            "SUPERUSER",
            "LOGIN"
          ],
          "db_owner": true,
          "username": "admin"
        }
      ],
      "nodes": [
        {
          "host_ids": [
            "host-1",
            "host-2",
            "host-3"
          ],
          "name": "n1"
        }
      ],
      "postgres_version": "17.10",
      "spock_version": "5"
    },
    "state": "modifying",
    "updated_at": "2026-06-29T13:56:47Z"
  },
  "task": {
    "created_at": "2026-06-29T13:56:47Z",
    "database_id": "storefront",
    "entity_id": "storefront",
    "scope": "database",
    "status": "pending",
    "task_id": "019f13ab-1665-7dbb-9bc3-47e21a2e934f",
    "type": "upgrade"
  }
}
[2026-06-29T13:56:49Z] refreshing current state
[2026-06-29T13:56:58Z] finished refreshing current state (took 9.004388712s)
[2026-06-29T13:57:00Z] updating resource swarm.patroni_config::storefront-n1-9ptayhma
[2026-06-29T13:57:10Z] finished updating resource swarm.patroni_config::storefront-n1-9ptayhma (took 10.014619005s)
[2026-06-29T13:57:10Z] updating resource swarm.postgres_service_spec::storefront-n1-9ptayhma
[2026-06-29T13:57:10Z] finished updating resource swarm.postgres_service_spec::storefront-n1-9ptayhma (took 117.792µs)
[2026-06-29T13:57:10Z] updating resource swarm.check_will_restart::storefront-n1-9ptayhma
[2026-06-29T13:57:10Z] finished updating resource swarm.check_will_restart::storefront-n1-9ptayhma (took 3.937959ms)
[2026-06-29T13:57:10Z] updating resource swarm.switchover::storefront-n1-9ptayhma
[2026-06-29T13:57:10Z] finished updating resource swarm.switchover::storefront-n1-9ptayhma (took 1.323458ms)
[2026-06-29T13:57:11Z] updating resource swarm.postgres_service::storefront-n1-9ptayhma
[2026-06-29T13:57:26Z] finished updating resource swarm.postgres_service::storefront-n1-9ptayhma (took 15.014658465s)
[2026-06-29T13:57:26Z] updating resource database.instance::storefront-n1-9ptayhma
[2026-06-29T13:57:36Z] finished updating resource database.instance::storefront-n1-9ptayhma (took 10.017048046s)
[2026-06-29T13:57:36Z] updating resource database.node::n1
[2026-06-29T13:57:36Z] finished updating resource database.node::n1 (took 194.125µs)
[2026-06-29T13:57:36Z] updating resource monitor.instance::storefront-n1-9ptayhma
[2026-06-29T13:57:36Z] finished updating resource monitor.instance::storefront-n1-9ptayhma (took 13.191792ms)
[2026-06-29T13:57:36Z] updating resource database.postgres_database::n1:storefront
[2026-06-29T13:57:36Z] finished updating resource database.postgres_database::n1:storefront (took 24.611083ms)
[2026-06-29T13:57:37Z] updating resource swarm.patroni_config::storefront-n1-ant97dj4
[2026-06-29T13:57:47Z] finished updating resource swarm.patroni_config::storefront-n1-ant97dj4 (took 10.012310296s)
[2026-06-29T13:57:47Z] updating resource swarm.postgres_service_spec::storefront-n1-ant97dj4
[2026-06-29T13:57:47Z] finished updating resource swarm.postgres_service_spec::storefront-n1-ant97dj4 (took 109.291µs)
[2026-06-29T13:57:48Z] updating resource swarm.check_will_restart::storefront-n1-ant97dj4
[2026-06-29T13:57:48Z] finished updating resource swarm.check_will_restart::storefront-n1-ant97dj4 (took 4.215417ms)
[2026-06-29T13:57:48Z] updating resource swarm.switchover::storefront-n1-ant97dj4
[2026-06-29T13:57:48Z] finished updating resource swarm.switchover::storefront-n1-ant97dj4 (took 1.361708ms)
[2026-06-29T13:57:48Z] updating resource swarm.postgres_service::storefront-n1-ant97dj4
[2026-06-29T13:58:03Z] finished updating resource swarm.postgres_service::storefront-n1-ant97dj4 (took 15.015888382s)
[2026-06-29T13:58:04Z] updating resource database.instance::storefront-n1-ant97dj4
[2026-06-29T13:58:14Z] finished updating resource database.instance::storefront-n1-ant97dj4 (took 10.012700338s)
[2026-06-29T13:58:14Z] updating resource database.node::n1
[2026-06-29T13:58:14Z] finished updating resource database.node::n1 (took 101.875µs)
[2026-06-29T13:58:14Z] updating resource monitor.instance::storefront-n1-ant97dj4
[2026-06-29T13:58:14Z] finished updating resource monitor.instance::storefront-n1-ant97dj4 (took 13.670042ms)
[2026-06-29T13:58:14Z] updating resource database.postgres_database::n1:storefront
[2026-06-29T13:58:14Z] finished updating resource database.postgres_database::n1:storefront (took 25.158291ms)
[2026-06-29T13:58:15Z] updating resource swarm.patroni_config::storefront-n1-689qacsi
[2026-06-29T13:58:25Z] finished updating resource swarm.patroni_config::storefront-n1-689qacsi (took 10.01192788s)
[2026-06-29T13:58:25Z] updating resource swarm.postgres_service_spec::storefront-n1-689qacsi
[2026-06-29T13:58:25Z] finished updating resource swarm.postgres_service_spec::storefront-n1-689qacsi (took 129.708µs)
[2026-06-29T13:58:26Z] updating resource swarm.check_will_restart::storefront-n1-689qacsi
[2026-06-29T13:58:26Z] finished updating resource swarm.check_will_restart::storefront-n1-689qacsi (took 4.023666ms)
[2026-06-29T13:58:26Z] updating resource swarm.switchover::storefront-n1-689qacsi
[2026-06-29T13:58:38Z] finished updating resource swarm.switchover::storefront-n1-689qacsi (took 12.06183659s)
[2026-06-29T13:58:38Z] updating resource swarm.postgres_service::storefront-n1-689qacsi
[2026-06-29T13:58:53Z] finished updating resource swarm.postgres_service::storefront-n1-689qacsi (took 15.015342839s)
[2026-06-29T13:58:54Z] updating resource database.instance::storefront-n1-689qacsi
[2026-06-29T13:59:04Z] finished updating resource database.instance::storefront-n1-689qacsi (took 10.012717755s)
[2026-06-29T13:59:04Z] updating resource database.node::n1
[2026-06-29T13:59:04Z] finished updating resource database.node::n1 (took 105.917µs)
[2026-06-29T13:59:04Z] updating resource monitor.instance::storefront-n1-689qacsi
[2026-06-29T13:59:04Z] finished updating resource monitor.instance::storefront-n1-689qacsi (took 13.903083ms)
[2026-06-29T13:59:04Z] updating resource database.postgres_database::n1:storefront
[2026-06-29T13:59:05Z] finished updating resource database.postgres_database::n1:storefront (took 31.230083ms)
[2026-06-29T13:59:05Z] creating resource database.switchover::storefront-n1-689qacsi
[2026-06-29T13:59:21Z] finished creating resource database.switchover::storefront-n1-689qacsi (took 16.066724674s)

database entity storefront task 019f13ab-1665-7dbb-9bc3-47e21a2e934f completed
  1. Verify state after 'apply-upgrade'
cp1-req get-database storefront| jq '{state, version: .spec.postgres_version}'
{
  "state": "available",
  "version": "17.10"
}

  1. Confirm servives
docker service ls
ID             NAME                              MODE         REPLICAS   IMAGE                                                        PORTS
dxe5ufbgx0ym   postgres-storefront-n1-9ptayhma   replicated   1/1        ghcr.io/pgedge/pgedge-postgres:17.10-spock5.0.8-standard-1   
axos7gw8jhlx   postgres-storefront-n1-689qacsi   replicated   1/1        ghcr.io/pgedge/pgedge-postgres:17.10-spock5.0.8-standard-1   
m7ebdleki555   postgres-storefront-n1-ant97dj4   replicated   1/1        ghcr.io/pgedge/pgedge-postgres:17.10-spock5.0.8-standard-1
  1. Error scenarios
 cp1-req apply-upgrade storefront \
  image: "ghcr.io/pgedge/pgedge-postgres:99.99-fake"
HTTP/1.1 400 Bad Request
Content-Length: 131
Content-Type: application/json
Date: Wed, 24 Jun 2026 06:17:57 GMT
Goa-Error: invalid_input

{
  message: "upgrade not available: image not found in manifest: ghcr.io/pgedge/pgedge-postgres:99.99-fake"
  name: "invalid_input"
}
cp1-req apply-upgrade storefront \
  image: "ghcr.io/pgedge/pgedge-postgres:17.10-spock5.0.8-standard-1"
HTTP/1.1 400 Bad Request
Content-Length: 113
Content-Type: application/json
Date: Wed, 24 Jun 2026 06:18:30 GMT
Goa-Error: invalid_input

{
  message: "upgrade not available: target version 17.10 is not newer than current 17.10"
  name: "invalid_input"
}

Checklist

  • Tests added

PLAT-603

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f64c3deb-7525-49ed-9634-6841059fa7db

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PLAT-603-apply-image-upgrade-to-a-database

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codacy-production

codacy-production Bot commented Jun 24, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 2 medium

Results:
2 new issues

Category Results
Complexity 2 medium

View in Codacy

🟢 Metrics 49 complexity · 13 duplication

Metric Results
Complexity 49
Duplication 13

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@tsivaprasad

Copy link
Copy Markdown
Contributor Author

@coderabbitai pls review this PR

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get PR details and file list
gh pr view 418 --repo pgEdge/control-plane --json files,additions,deletions,changedFiles 2>/dev/null || echo "Could not fetch PR details"

# Get the list of changed files
gh pr diff 418 --repo pgEdge/control-plane --name-only 2>/dev/null || echo "Could not fetch file list"

Length of output: 5025


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get full diff
gh pr diff 418 --repo pgEdge/control-plane 2>/dev/null | head -3000

Length of output: 122777


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Fetch the key non-generated source files
echo "=== server/internal/database/service.go ==="
cat server/internal/database/service.go | grep -A 100 "ApplyUpgrade"

echo "=== server/internal/database/orchestrator.go ==="
cat server/internal/database/orchestrator.go

echo "=== server/internal/orchestrator/swarm/orchestrator.go ==="
cat server/internal/orchestrator/swarm/orchestrator.go | grep -A 80 "FindUpgrade"

echo "=== server/internal/orchestrator/swarm/images.go ==="
cat server/internal/orchestrator/swarm/images.go

echo "=== server/internal/orchestrator/systemd/orchestrator.go ==="
cat server/internal/orchestrator/systemd/orchestrator.go | tail -30

echo "=== server/internal/task/task.go ==="
cat server/internal/task/task.go | grep -B2 -A5 "TypeUpgrade\|TypeUpdate"

echo "=== server/internal/workflows/service.go ==="
cat server/internal/workflows/service.go | grep -A 40 "UpgradeDatabase"

echo "=== server/internal/api/apiv1/post_init_handlers.go ==="
cat server/internal/api/apiv1/post_init_handlers.go | grep -A 60 "ApplyUpgrade"

echo "=== server/internal/api/apiv1/pre_init_handlers.go ==="
cat server/internal/api/apiv1/pre_init_handlers.go | tail -20

echo "=== server/internal/api/apiv1/errors.go ==="
cat server/internal/api/apiv1/errors.go | tail -20

echo "=== server/internal/orchestrator/swarm/find_upgrade_test.go ==="
cat server/internal/orchestrator/swarm/find_upgrade_test.go

Length of output: 34975


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== errors.go full new section ==="
gh pr diff 418 --repo pgEdge/control-plane -- server/internal/api/apiv1/errors.go

echo "=== pre_init_handlers.go ApplyUpgrade stub ==="
gh pr diff 418 --repo pgEdge/control-plane -- server/internal/api/apiv1/pre_init_handlers.go

echo "=== systemd orchestrator FindUpgrade ==="
gh pr diff 418 --repo pgEdge/control-plane -- server/internal/orchestrator/systemd/orchestrator.go

echo "=== database service.go full ApplyUpgrade context (broader) ==="
grep -n "UpdateDatabaseState\|ErrUpgradeNotAvailable\|prevState\|rollback" server/internal/database/service.go || true
grep -n "UpdateDatabaseState\|ErrUpgradeNotAvailable\|prevState\|rollback" server/internal/api/apiv1/post_init_handlers.go || true

Length of output: 1181


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== errors.go (full) ==="
cat server/internal/api/apiv1/errors.go

echo "=== pre_init_handlers.go ApplyUpgrade grep ==="
grep -n "ApplyUpgrade\|ErrUninitialized" server/internal/api/apiv1/pre_init_handlers.go || true

echo "=== systemd orchestrator.go FindUpgrade + AvailableUpgrades ==="
grep -n "FindUpgrade\|AvailableUpgrades\|ErrUpgradeNotAvailable" server/internal/orchestrator/systemd/orchestrator.go || true

echo "=== post_init_handlers.go state rollback lines ==="
grep -n "prevState\|UpdateDatabaseState\|rollback\|restorationErr" server/internal/api/apiv1/post_init_handlers.go || true

Length of output: 7906



@tsivaprasad: Here is a thorough review of PR #418.


Summary

The feature is well-structured: clean API design, correct Goa type generation, solid swarm-level validation logic, good test coverage for FindUpgrade, and consistent error-mapping. A few issues are worth addressing before merge.


🔴 High: Spec update is not rolled back on workflow trigger failure

File: server/internal/api/apiv1/post_init_handlers.go

ApplyUpgrade in the service layer atomically persists both the new PostgresVersion in the spec and the modifying state in a single etcd transaction. If UpgradeDatabase subsequently fails (e.g., transient etcd hiccup when creating the task), the handler rolls back the state only:

// ApplyUpgrade atomically writes: spec.PostgresVersion=17.10 + state=modifying
db, err := s.dbSvc.ApplyUpgrade(ctx, databaseID, req.Request.Image)

t, err := s.workflowSvc.UpgradeDatabase(ctx, db)
if err != nil {
    // ✅ state rolled back: modifying → available
    // ❌ spec NOT rolled back: spec still says PostgresVersion=17.10
    s.dbSvc.UpdateDatabaseState(ctx, databaseID, database.DatabaseStateModifying, prevState)
    return nil, apiErr(err)
}

Impact: After this failure path, the database appears available (rolled-back state) but its spec already claims postgres_version: "17.10". Because FindUpgrade enforces strictly newer, the user cannot retry apply-upgrade with the same 17.10 image — they're stuck unless they find a 17.11 or manually call update-database.

Suggested fix options:

  • Option A (simplest): Also restore the spec in the rollback — store prevSpec before calling ApplyUpgrade and write both the old spec and old state back atomically on failure.
  • Option B: Restructure so ApplyUpgrade only updates state to modifying (not the spec), and let the workflow itself update the spec as its first step. This makes the workflow the single source of truth for spec changes.

🟡 Medium: Pre-flight GetDatabase creates a TOCTOU window for prevState

File: server/internal/api/apiv1/post_init_handlers.go

existing, err := s.dbSvc.GetDatabase(ctx, databaseID)  // read `#1`
prevState := existing.State

db, err := s.dbSvc.ApplyUpgrade(ctx, databaseID, ...)  // read `#2` + `#3` internally

The state captured by GetDatabase is already stale by the time ApplyUpgrade validates and commits. In the rare case another operation changes the state between these two calls, the rollback would restore the wrong prevState. A cleaner API would be to have ApplyUpgrade return the previous state, eliminating the pre-flight read.


🟡 Medium: No service-layer or handler-layer tests for ApplyUpgrade

The find_upgrade_test.go tests the swarm orchestrator in isolation, but there are no tests for:

  • database.Service.ApplyUpgrade (spec mutation, per-node version clearing, etcd atomicity)
  • PostInitHandlers.ApplyUpgrade (rollback logic, happy path wiring)

Given the rollback complexity highlighted above, service-level tests would be especially valuable.


🟠 Minor: Dead variable in test

File: server/internal/orchestrator/swarm/find_upgrade_test.go

In the "rejects older version" subtest, older and olderImg are allocated but olderImg is only assigned to _:

t.Run("rejects older version", func(t *testing.T) {
    older := ds.MustParsePgEdgeVersion("17.10", "5")   // used only to fetch image
    olderImg, err2 := v.GetImages(older)
    require.NoError(t, err2)
    // ...
    _ = olderImg   // dead — olderImg is never asserted
})

The intent of the test seems clear but the test name and the variable setup are slightly misleading. Consider either asserting on olderImg or simplifying:

t.Run("rejects older version (downgrade 17.10 → 17.9)", func(t *testing.T) {
    current1710 := ds.MustParsePgEdgeVersion("17.10", "5")
    img179, err := v.GetImages(ds.MustParsePgEdgeVersion("17.9", "5"))
    require.NoError(t, err)
    _, err = o.FindUpgrade(current1710, img179.PgEdgeImage)
    require.Error(t, err)
    assert.True(t, errors.Is(err, database.ErrUpgradeNotAvailable))
})

🟠 Minor: "stable stability is accepted" test name is misleading

File: server/internal/orchestrator/swarm/find_upgrade_test.go

The subtest named "stable stability is accepted" actually verifies the same stable version is rejected (not newer), which is correct behavior but the name implies acceptance. Consider renaming to "stable same-version is rejected".


🟢 Positive observations

  • FindUpgrade validation chain is thorough: manifest lookup → stability filter → PG major → Spock major → strictly newer. All error paths correctly wrap ErrUpgradeNotAvailable for errors.Is unwrapping.
  • ErrUpgradeNotAvailableinvalid_input (400) mapping in errors.go is correct and consistent with the tested error scenarios in the PR description.
  • Per-node version override clearing in ApplyUpgrade is well-reasoned — nodes with a matching version override get cleared, heterogeneous nodes are left intentionally untouched.
  • Systemd stub (FindUpgrade returning ErrUpgradeNotAvailable with a package-manager hint) is a clean, user-friendly direction.
  • TypeUpgrade task type provides the audit-trail distinction as intended.
  • Rollback of state on workflow failure follows the established pattern in DeleteDatabase and UpdateDatabase handlers.
  • All generated Goa/OpenAPI files look correct and consistent.

@moizpgedge moizpgedge self-requested a review June 24, 2026 12:43

@moizpgedge moizpgedge left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this end to end on a dev cluster. The endpoint works: upgrade returned 200, all four containers moved to 17.10 and came back healthy, and the error cases were correct (empty image 400, unknown db 404, fake image 400).

But the upgrade finished in a failed state, twice, with the same error:

failed to update subscription on node n1:
ERROR: cannot assign TransactionIds during recovery (SQLSTATE XX000)
the upgrade reuses the UpdateDatabase workflow, which restarts each node and then reconciles Spock subscriptions. A just restarted node is briefly in recovery (read only), so the subscription write is rejected and the whole upgrade is marked failed, even though every container already upgraded and is healthy. Nothing waits for the node to leave recovery and the step does not retry.

The logs show the trigger clearly: the upgrade does a switchover on each node, and the subscription write then runs against the just promoted primary while it is still finishing recovery.

Result of my two runs:

author log my run 1 my run 2
containers on 17.10 yes yes yes
final db state available failed failed

@tsivaprasad tsivaprasad requested a review from moizpgedge June 26, 2026 16:26
@tsivaprasad

Copy link
Copy Markdown
Contributor Author

I ran this end to end on a dev cluster. The endpoint works: upgrade returned 200, all four containers moved to 17.10 and came back healthy, and the error cases were correct (empty image 400, unknown db 404, fake image 400).

But the upgrade finished in a failed state, twice, with the same error:

failed to update subscription on node n1: ERROR: cannot assign TransactionIds during recovery (SQLSTATE XX000) the upgrade reuses the UpdateDatabase workflow, which restarts each node and then reconciles Spock subscriptions. A just restarted node is briefly in recovery (read only), so the subscription write is rejected and the whole upgrade is marked failed, even though every container already upgraded and is healthy. Nothing waits for the node to leave recovery and the step does not retry.

The logs show the trigger clearly: the upgrade does a switchover on each node, and the subscription write then runs against the just promoted primary while it is still finishing recovery.

Result of my two runs:

author log my run 1 my run 2
containers on 17.10 yes yes yes
final db state available failed failed

Thanks @moizpgedge! That issue is now resolved. Please check now.

@tsivaprasad tsivaprasad marked this pull request as draft June 26, 2026 17:10
@tsivaprasad tsivaprasad force-pushed the PLAT-603-apply-image-upgrade-to-a-database branch 2 times, most recently from c11eb50 to 6286ea4 Compare June 27, 2026 06:18
@tsivaprasad tsivaprasad marked this pull request as ready for review June 27, 2026 07:12

@moizpgedge moizpgedge left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.It looks good to me

Comment thread server/internal/postgres/errors.go Outdated
//
// Both conditions are transient and clear within seconds of promotion
// completing. Callers should retry until this function returns false.
func IsRecoveryError(err error) bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of IsRecoveryError? I could not find any callers in the code or tests

@tsivaprasad tsivaprasad force-pushed the PLAT-603-apply-image-upgrade-to-a-database branch 2 times, most recently from a010c48 to b75d238 Compare June 29, 2026 14:04
@tsivaprasad tsivaprasad force-pushed the PLAT-603-apply-image-upgrade-to-a-database branch from b75d238 to 12f219a Compare June 29, 2026 14:06
@tsivaprasad

tsivaprasad commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@moizpgedge Pls reverify this PR and the issue we saw will be addressed as part of PLAT-665

@moizpgedge moizpgedge self-requested a review June 29, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants