Skip to content

DAOS-19148 cart: Fix crt_swim_rank_check SUSPECT leaks#18547

Open
liw wants to merge 1 commit into
masterfrom
liw/swim-suspect-new-inc
Open

DAOS-19148 cart: Fix crt_swim_rank_check SUSPECT leaks#18547
liw wants to merge 1 commit into
masterfrom
liw/swim-suspect-new-inc

Conversation

@liw

@liw liw commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

We've observed the following events:

05:10:12.442962 swim_member_suspect() 6: member 0 2747111301814681600 is SUSPECT from 2
05:10:28.810595 ds_mgmt_group_update() updated group: 120 -> 122: 10 ranks
05:10:33.344560 swim_member_dead() 6: member 0 2747126810886799360 is DEAD from 6 (self)

PG version 122 includes a rejoined member 0, whose incarnation changed from 2747111301814681600 to 2747126810886799360. Since the suspicion timeout was 20 s, how could there be a DEAD event for the newer incarnation of member 0 within less than 5 s since the PG update? The problem is that crt_swim_rank_check forgets to delete the existing suspicion on member 0, as suggested by the first event, which happened about one suspicion timeout before the DEAD event.

Fundamentally, such suspicions (and perhaps other swim_item objects) should be about specific incarnations, not just member IDs. But that might require a larger change. In this patch, we offer a quick fix, leaving the larger change to a future patch.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

Copy link
Copy Markdown

Ticket title is '[2.8.0 rc1] Soak harasser: reintegrate failed - DER_OOG(-1019): Out of group or member list'
Status is 'Open'
Labels: 'soak'
https://daosio.atlassian.net/browse/DAOS-19148

We've observed the following events:

  05:10:12.442962 swim_member_suspect() 6: member 0 2747111301814681600
    is SUSPECT from 2
  05:10:28.810595 ds_mgmt_group_update() updated group: 120 -> 122: 10
    ranks
  05:10:33.344560 swim_member_dead() 6: member 0 2747126810886799360 is
    DEAD from 6 (self)

PG version 122 includes a rejoined member 0, whose incarnation changed
from 2747111301814681600 to 2747126810886799360. Since the suspicion
timeout was 20 s, how could there be a DEAD event for the newer
incarnation of member 0 within less than 5 s since the PG update? The
problem is that crt_swim_rank_check forgets to delete the existing
suspicion on member 0, as suggested by the first event, which happened
about one suspicion timeout before the DEAD event.

Fundamentally, such suspicions (and perhaps other swim_item objects)
should be about specific incarnations, not just member IDs. But that
might require a larger change. In this patch, we offer a quick fix,
leaving the larger change to a future patch.

Signed-off-by: Li Wei <liwei@hpe.com>
@liw liw force-pushed the liw/swim-suspect-new-inc branch from c123444 to 61d7afb Compare June 26, 2026 03:24
@liw liw marked this pull request as ready for review June 29, 2026 02:55
@liw liw requested review from a team as code owners June 29, 2026 02:55
@liw liw requested review from frostedcmos and jgmoore-or June 29, 2026 02:55

@daltonbohning daltonbohning left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@liw

liw commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@frostedcmos, @jgmoore-or, ping... (Sorry, but I think this should go into the upcoming 2.8.0.)

@daltonbohning

Copy link
Copy Markdown
Contributor

@frostedcmos, @jgmoore-or, ping... (Sorry, but I think this should go into the upcoming 2.8.0.)

If you think this should be in 2.8.0 then I would suggest trying to get approval for the ticket ASAP because there is little time left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants