Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race by CRZbulabula · Pull Request #17878 · apache/iotdb

CRZbulabula · 2026-06-09T07:47:12Z

Description

This PR hardens ConfigNode RegionGroup / partition handling and clarifies the RemoveDataNode error, in two independent areas.

RemoveDataNode (single replica)

Reject RemoveDataNode early when data_replication_factor is 1 or an existing DataRegion has only one replica (hasSingleDataRegionReplica), so the request fails fast with a clear, user-facing reason instead of a generic "failed to remove" message.
Reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): the message now states the cause and ends with "Removing DataNodes is not supported with single replica.", dropping the previous misleading data-loss / "increase data_replication_factor" tail.
Add IT coverage (IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL) asserting the new single-replica error message.

RegionGroup creation robustness + race diagnostics

Propagate RegionGroup persistence failures: ConfigNodeProcedureEnv.persistRegionGroup now returns a TSStatus, and CreateRegionGroupsProcedure fails the procedure (setFailure) instead of continuing to activate RegionGroups after a failed consensus write.
Add a single, failure-path-only diagnostic for the intermittent "no available SchemaRegionGroups" race seen in CI (e.g. IoTDBRawQueryWithoutValueFilterWithDeletionIT). Right before getSortedRegionGroupSlotsCounter throws NoAvailableRegionGroupException, it now logs (WARN, once) every RegionGroup visible in PartitionInfo for the Database together with its LoadCache status. An empty set tells us PartitionInfo has not exposed the new RegionGroup yet; a non-empty all-Disabled set tells us the RegionGroups exist but LoadCache still marks them unavailable. This pinpoints the root cause on the next reproduction without flooding the log (it only fires when allocation already failed).

Note: an earlier revision of this PR added a fixed-timeout busy-wait (waitForRegionGroupsVisible) in PartitionManager to wait for newly created RegionGroups to become visible. That has been removed: on the observed code path it only masks the race probabilistically rather than fixing it, so this PR ships the targeted diagnostic instead and leaves the actual root-cause fix to a follow-up once the next repro confirms which scenario occurs.

Tests

mvn compile -pl iotdb-core/confignode
mvn verify -DskipUTs -Drat.skip=true -Dit.test=IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -am -PClusterIT -P with-integration-tests

🤖 Generated with Claude Code

codecov · 2026-06-09T08:39:27Z

Codecov Report

❌ Patch coverage is 0% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.74%. Comparing base (c3e74a2) to head (047c0a5).
⚠️ Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
...onfignode/procedure/env/RemoveDataNodeHandler.java	0.00%	22 Missing ⚠️
...confignode/manager/partition/PartitionManager.java	0.00%	6 Missing ⚠️
...nfignode/procedure/env/ConfigNodeProcedureEnv.java	0.00%	4 Missing ⚠️
...edure/impl/region/CreateRegionGroupsProcedure.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17878      +/-   ##
============================================
+ Coverage     40.54%   40.74%   +0.19%     
+ Complexity     2622     2621       -1     
============================================
  Files          5244     5244              
  Lines        362367   362716     +349     
  Branches      46651    46687      +36     
============================================
+ Hits         146938   147786     +848     
+ Misses       215429   214930     -499

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…clarify RemoveDataNode message Drop the 10s busy-wait waitForRegionGroupsVisible loop in PartitionManager: the schema-region create/activate/allocate path is unchanged from the PR baseline, so the poll only masks the intermittent "no available RegionGroup" race probabilistically rather than fixing it. Instead, log a single WARN on the failure path (right before throwing NoAvailableRegionGroupException) that dumps every RegionGroup visible in PartitionInfo for the Database and its LoadCache status. This pinpoints, on the next CI repro, whether PartitionInfo has no RegionGroup yet or has some that are all Disabled — without flooding the log, since it only fires when allocation already failed. Also reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): drop the misleading data-loss / increase-factor tail and end with "Removing DataNodes is not supported with single replica."

The previous guard rejected removing any DataNode whenever the cluster kept a single replica (data_replication_factor == 1), which broke IoTDBClusterNodeGetterIT.queryAndRemoveDataNodeTest (2C2D, single replica, removing 1 of 2 DataNodes is legal and must return SUCCESS). Drop the blanket hasSingleDataRegionReplica() guard and reuse the existing capacity check: removal is rejected only when it would leave fewer than NodeInfo.getMinimumDataNode() DataNodes. Under a true single replica (MINIMUM_DATANODE == max(schema, data) == 1) that means only the last remaining DataNode cannot be removed, with a dedicated message. Updated the failing-path IT accordingly.

…node removal The previous test set up 1C2D and removed both DataNodes via `remove datanode 2, 3`, but the REMOVE DATANODE grammar accepts only a single INTEGER_LITERAL, so the SQL failed to parse (`mismatched input ','`) instead of hitting the capacity check. The test then asserted on the wrong exception and failed in CI (Cluster IT - 1C3D / Simple). Set up 1C1D and remove the only DataNode, producing a valid single-id `remove datanode <id>`. Removing it leaves 0 < MINIMUM_DATANODE (1), so checkRegionReplication rejects it with the single-replica message, matching the assertions. Verified locally with -PClusterIT: Tests run: 1, Failures: 0.

The previous code special-cased only the single-replica (MINIMUM_DATANODE == 1) "last DataNode" case and otherwise emitted a generic message whose label "maxReplicaFactor" was misleading and which never showed how many DataNodes the request actually tried to remove. Replace the if/else with one parameterized message that always reports the gap: how many DataNodes are being removed, how many are available, the minimum that must remain (max(schema, data) replication factor, with both factors spelled out) and how many would be left. When the minimum is 1, append a single-replica hint explaining at least one DataNode must always remain. Rename the i18n constant to FAILED_TO_REMOVE_DATA_NODE_WOULD_LEAVE_TOO_FEW (now a format string) and add FAILED_TO_REMOVE_DATA_NODE_SINGLE_REPLICA_HINT, both en and zh. Strengthened the failing-path IT to assert the unified message ("Cannot remove" + "single replica"). Verified with -PClusterIT: Tests run: 1, Failures: 0; ConfigNode log shows the rendered message.

@category

failWhenRemovingLastSingleReplicaDataNodeUseSQL only needs a 1C1D cluster, but it lived in IoTDBRemoveDataNodeNormalIT which is @category({ClusterIT.class}) and therefore ran in the 1C3D suite alongside the multi-DataNode removal tests. Extract it into a new IoTDBRemoveLastDataNodeIT annotated @category({LocalStandaloneIT.class}) so it runs in the 1C1D (Simple) suite, isolated from the cluster tests. Verified with the default with-integration-tests profile (no -PClusterIT): Tests run: 1, Failures: 0 on a 1C1D cluster.

sonarqubecloud · 2026-06-10T09:30:08Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

CRZbulabula changed the title ~~Fix SchemaRegion visibility race and clarify RemoveDataNode error~~ Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race Jun 9, 2026

CRZbulabula added 2 commits June 9, 2026 19:21

Fix schema region visibility race and remove datanode message

96b3c05

CRZbulabula force-pushed the fix-schema-region-partition-race-remove-message branch from 977ebb7 to 9c46d01 Compare June 9, 2026 11:30

CRZbulabula added 4 commits June 10, 2026 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878

Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878
CRZbulabula wants to merge 6 commits into
masterfrom
fix-schema-region-partition-race-remove-message

CRZbulabula commented Jun 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CRZbulabula commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CRZbulabula commented Jun 9, 2026 •

edited

Loading

codecov Bot commented Jun 9, 2026 •

edited

Loading