Clarify RemoveDataNode single-replica error and add diagnostics for the no-available-RegionGroup race#17878
Open
CRZbulabula wants to merge 6 commits into
Open
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17878 +/- ##
============================================
+ Coverage 40.54% 40.74% +0.19%
+ Complexity 2622 2621 -1
============================================
Files 5244 5244
Lines 362367 362716 +349
Branches 46651 46687 +36
============================================
+ Hits 146938 147786 +848
+ Misses 215429 214930 -499 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…clarify RemoveDataNode message Drop the 10s busy-wait waitForRegionGroupsVisible loop in PartitionManager: the schema-region create/activate/allocate path is unchanged from the PR baseline, so the poll only masks the intermittent "no available RegionGroup" race probabilistically rather than fixing it. Instead, log a single WARN on the failure path (right before throwing NoAvailableRegionGroupException) that dumps every RegionGroup visible in PartitionInfo for the Database and its LoadCache status. This pinpoints, on the next CI repro, whether PartitionInfo has no RegionGroup yet or has some that are all Disabled — without flooding the log, since it only fires when allocation already failed. Also reword FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE (en + zh): drop the misleading data-loss / increase-factor tail and end with "Removing DataNodes is not supported with single replica."
977ebb7 to
9c46d01
Compare
The previous guard rejected removing any DataNode whenever the cluster kept a single replica (data_replication_factor == 1), which broke IoTDBClusterNodeGetterIT.queryAndRemoveDataNodeTest (2C2D, single replica, removing 1 of 2 DataNodes is legal and must return SUCCESS). Drop the blanket hasSingleDataRegionReplica() guard and reuse the existing capacity check: removal is rejected only when it would leave fewer than NodeInfo.getMinimumDataNode() DataNodes. Under a true single replica (MINIMUM_DATANODE == max(schema, data) == 1) that means only the last remaining DataNode cannot be removed, with a dedicated message. Updated the failing-path IT accordingly.
…node removal The previous test set up 1C2D and removed both DataNodes via `remove datanode 2, 3`, but the REMOVE DATANODE grammar accepts only a single INTEGER_LITERAL, so the SQL failed to parse (`mismatched input ','`) instead of hitting the capacity check. The test then asserted on the wrong exception and failed in CI (Cluster IT - 1C3D / Simple). Set up 1C1D and remove the only DataNode, producing a valid single-id `remove datanode <id>`. Removing it leaves 0 < MINIMUM_DATANODE (1), so checkRegionReplication rejects it with the single-replica message, matching the assertions. Verified locally with -PClusterIT: Tests run: 1, Failures: 0.
The previous code special-cased only the single-replica (MINIMUM_DATANODE
== 1) "last DataNode" case and otherwise emitted a generic message whose
label "maxReplicaFactor" was misleading and which never showed how many
DataNodes the request actually tried to remove.
Replace the if/else with one parameterized message that always reports
the gap: how many DataNodes are being removed, how many are available,
the minimum that must remain (max(schema, data) replication factor, with
both factors spelled out) and how many would be left. When the minimum
is 1, append a single-replica hint explaining at least one DataNode must
always remain. Rename the i18n constant to
FAILED_TO_REMOVE_DATA_NODE_WOULD_LEAVE_TOO_FEW (now a format string) and
add FAILED_TO_REMOVE_DATA_NODE_SINGLE_REPLICA_HINT, both en and zh.
Strengthened the failing-path IT to assert the unified message ("Cannot
remove" + "single replica"). Verified with -PClusterIT:
Tests run: 1, Failures: 0; ConfigNode log shows the rendered message.
failWhenRemovingLastSingleReplicaDataNodeUseSQL only needs a 1C1D cluster, but it lived in IoTDBRemoveDataNodeNormalIT which is @category({ClusterIT.class}) and therefore ran in the 1C3D suite alongside the multi-DataNode removal tests. Extract it into a new IoTDBRemoveLastDataNodeIT annotated @category({LocalStandaloneIT.class}) so it runs in the 1C1D (Simple) suite, isolated from the cluster tests. Verified with the default with-integration-tests profile (no -PClusterIT): Tests run: 1, Failures: 0 on a 1C1D cluster.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
This PR hardens ConfigNode RegionGroup / partition handling and clarifies the RemoveDataNode error, in two independent areas.
RemoveDataNode (single replica)
RemoveDataNodeearly whendata_replication_factoris 1 or an existing DataRegion has only one replica (hasSingleDataRegionReplica), so the request fails fast with a clear, user-facing reason instead of a generic "failed to remove" message.FAILED_TO_REMOVE_DATA_NODE_BECAUSE_DATA_REPLICATION_FACTOR_IS_ONE(en + zh): the message now states the cause and ends with "Removing DataNodes is not supported with single replica.", dropping the previous misleading data-loss / "increase data_replication_factor" tail.IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL) asserting the new single-replica error message.RegionGroup creation robustness + race diagnostics
ConfigNodeProcedureEnv.persistRegionGroupnow returns aTSStatus, andCreateRegionGroupsProcedurefails the procedure (setFailure) instead of continuing to activate RegionGroups after a failed consensus write.IoTDBRawQueryWithoutValueFilterWithDeletionIT). Right beforegetSortedRegionGroupSlotsCounterthrowsNoAvailableRegionGroupException, it now logs (WARN, once) every RegionGroup visible inPartitionInfofor the Database together with its LoadCache status. An empty set tells us PartitionInfo has not exposed the new RegionGroup yet; a non-empty all-Disabledset tells us the RegionGroups exist but LoadCache still marks them unavailable. This pinpoints the root cause on the next reproduction without flooding the log (it only fires when allocation already failed).Note: an earlier revision of this PR added a fixed-timeout busy-wait (
waitForRegionGroupsVisible) inPartitionManagerto wait for newly created RegionGroups to become visible. That has been removed: on the observed code path it only masks the race probabilistically rather than fixing it, so this PR ships the targeted diagnostic instead and leaves the actual root-cause fix to a follow-up once the next repro confirms which scenario occurs.Tests
mvn compile -pl iotdb-core/confignodemvn verify -DskipUTs -Drat.skip=true -Dit.test=IoTDBRemoveDataNodeNormalIT#failWhenDataReplicationFactorIsOneUseSQL -DfailIfNoTests=false -Dfailsafe.failIfNoSpecifiedTests=false -pl integration-test -am -PClusterIT -P with-integration-tests🤖 Generated with Claude Code