Skip to content

DAOS-19028 test: REBUILD29 use more precise timing#18546

Open
kccain wants to merge 1 commit into
masterfrom
kccain/daos_19028_testfix
Open

DAOS-19028 test: REBUILD29 use more precise timing#18546
kccain wants to merge 1 commit into
masterfrom
kccain/daos_19028_testfix

Conversation

@kccain

@kccain kccain commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Before this change, rebuild_kill_PS_leader_during_rebuild() killed
a non-leader engine and immediately (tried to) inject fault
DAOS_REBUILD_TGT_SCAN_HANG on "all engines". This fault injection
itself suffered RPC timeouts due to the killed engine. This further
affected the test's overall timing, contradicting the goal of
killing the PS leader engine during the first rebuild.

With this change, the test no longer uses fault injection. Instead it
waits for the first rebuild to start and demonstrate evidence of
scanning activity (rs_toberb_obj_nr > 0). This is accomplished using a
new common function test_rebuild_wait_to_scanning_next() that waits
for both the rs_version (pool map version) to increment and the
to-be-rebuilt number of objects to become nonzero.

Test-repeat: 10
Test-tag: test_rebuild_29
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test-rpms: true

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Before this change, rebuild_kill_PS_leader_during_rebuild() killed
a non-leader engine and immediately (tried to) inject fault
DAOS_REBUILD_TGT_SCAN_HANG on "all engines". This fault injection
itself suffered RPC timeouts due to the killed engine. This further
affected the test's overall timing, contradicting the goal of
killing the PS leader engine during the first rebuild.

With this change, the test no longer uses fault injection. Instead it
waits for the first rebuild to start *and* demonstrate evidence of
scanning activity (rs_toberb_obj_nr > 0). This is accomplished using a
new common function test_rebuild_wait_to_scanning_next() that waits
for both the rs_version (pool map version) to increment and the
to-be-rebuilt number of objects to become nonzero.

Test-repeat: 10
Test-tag: test_rebuild_29
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-test-rpms: true

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@kccain kccain added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jun 25, 2026
@github-actions

Copy link
Copy Markdown

Ticket title is 'daos_test/rebuild.py:DaosCoreTestRebuild.test_rebuild_29 - pool reintegrate failed'
Status is 'In Progress'
Labels: '2.6.5rc3,pr_test,scrubbed_2.8,tcp_provider,test_2.6.5rc1'
https://daosio.atlassian.net/browse/DAOS-19028

@kccain

kccain commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

New version of test passed 10x repeats
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18546/1/testReport/FTEST_daos_test/DaosCoreTestRebuild-DAOS_Rebuild/

Seems like a roughly comparable execution time to the existing test, so probably no need to adjust the overall test timeout for it. The worst execution time for the 10x repeats with the PR change is 4m 17 seconds, versus worst time from a few recent master daily passing REBUILD29 runs was 4m 11 seconds (in build 354).

@kccain kccain marked this pull request as ready for review June 28, 2026 09:53
@kccain kccain requested review from liuxuezhao and wangshilong June 28, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

1 participant