Skip to content

DAOS-17301 dfuse: reduce EQ busy polling#18549

Open
wangshilong wants to merge 3 commits into
masterfrom
shilongw/DAOS-17301-dfuse
Open

DAOS-17301 dfuse: reduce EQ busy polling#18549
wangshilong wants to merge 3 commits into
masterfrom
shilongw/DAOS-17301-dfuse

Conversation

@wangshilong

Copy link
Copy Markdown
Contributor

The dfuse progress thread currently polls DAOS EQ with NOWAIT while there are outstanding events. If the server is down or events make no progress, this can turn into busy polling.

Track consecutive empty EQ polls and switch to a small bounded poll timeout after repeated empty polls. The timeout backs off from 50us and is capped at 5ms, while active completion traffic resets the counter and keeps the existing NOWAIT behavior.

This reduces CPU usage during stalled/no-progress periods without changing the normal async submit path.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

The dfuse progress thread currently polls DAOS EQ with NOWAIT while
there are outstanding events. If the server is down or events make no
progress, this can turn into busy polling.

Track consecutive empty EQ polls and switch to a small bounded poll
timeout after repeated empty polls. The timeout backs off from 50us and
is capped at 5ms, while active completion traffic resets the counter and
keeps the existing NOWAIT behavior.

This reduces CPU usage during stalled/no-progress periods without
changing the normal async submit path.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong requested review from a team as code owners June 26, 2026 09:17
@github-actions

Copy link
Copy Markdown

Ticket title is 'Enable client reconnect after server reboot'
Status is 'In Progress'
Labels: 'scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-17301

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18549/1/execution/node/1329/log

@wangshilong wangshilong requested review from knard38 and mchaarawi June 30, 2026 02:09
Comment thread src/client/dfuse/dfuse_core.c Outdated
Comment on lines +93 to +94
eqt->de_empty_polls = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one concern i have here is that this is reset to 0 based only on the condition.
so a stalled event (that maybe takes a long time to complete) as i understand can cause new events inheriting that backoff.
so maybe we should reset that too when we add new events to the EQ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed.


#define DFUSE_EQ_BACKOFF_START 64
#define DFUSE_EQ_BACKOFF_MIN_US 50
#define DFUSE_EQ_BACKOFF_MAX_US 5000

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the effective maximum based on the loop is 3200 not 5000 so that declaration is misleading.
it should be 3200 or 6400

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong requested a review from mchaarawi July 1, 2026 02:28

@mchaarawi mchaarawi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to have a regression test for this that would check for busy polling when server is stalled or even down. but probably not easy with CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants