DAOS-17301 dfuse: reduce EQ busy polling#18549
Conversation
The dfuse progress thread currently polls DAOS EQ with NOWAIT while there are outstanding events. If the server is down or events make no progress, this can turn into busy polling. Track consecutive empty EQ polls and switch to a small bounded poll timeout after repeated empty polls. The timeout backs off from 50us and is capped at 5ms, while active completion traffic resets the counter and keeps the existing NOWAIT behavior. This reduces CPU usage during stalled/no-progress periods without changing the normal async submit path. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
|
Ticket title is 'Enable client reconnect after server reboot' |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18549/1/execution/node/1329/log |
| eqt->de_empty_polls = 0; | ||
|
|
There was a problem hiding this comment.
one concern i have here is that this is reset to 0 based only on the condition.
so a stalled event (that maybe takes a long time to complete) as i understand can cause new events inheriting that backoff.
so maybe we should reset that too when we add new events to the EQ?
|
|
||
| #define DFUSE_EQ_BACKOFF_START 64 | ||
| #define DFUSE_EQ_BACKOFF_MIN_US 50 | ||
| #define DFUSE_EQ_BACKOFF_MAX_US 5000 |
There was a problem hiding this comment.
the effective maximum based on the loop is 3200 not 5000 so that declaration is misleading.
it should be 3200 or 6400
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
mchaarawi
left a comment
There was a problem hiding this comment.
it would be great to have a regression test for this that would check for busy polling when server is stalled or even down. but probably not easy with CI
The dfuse progress thread currently polls DAOS EQ with NOWAIT while there are outstanding events. If the server is down or events make no progress, this can turn into busy polling.
Track consecutive empty EQ polls and switch to a small bounded poll timeout after repeated empty polls. The timeout backs off from 50us and is capped at 5ms, while active completion traffic resets the counter and keeps the existing NOWAIT behavior.
This reduces CPU usage during stalled/no-progress periods without changing the normal async submit path.
Steps for the author:
After all prior steps are complete: