DAOS-19148 cart: Fix crt_swim_rank_check SUSPECT leaks#18547
Open
liw wants to merge 1 commit into
Open
Conversation
|
Ticket title is '[2.8.0 rc1] Soak harasser: reintegrate failed - DER_OOG(-1019): Out of group or member list' |
We've observed the following events:
05:10:12.442962 swim_member_suspect() 6: member 0 2747111301814681600
is SUSPECT from 2
05:10:28.810595 ds_mgmt_group_update() updated group: 120 -> 122: 10
ranks
05:10:33.344560 swim_member_dead() 6: member 0 2747126810886799360 is
DEAD from 6 (self)
PG version 122 includes a rejoined member 0, whose incarnation changed
from 2747111301814681600 to 2747126810886799360. Since the suspicion
timeout was 20 s, how could there be a DEAD event for the newer
incarnation of member 0 within less than 5 s since the PG update? The
problem is that crt_swim_rank_check forgets to delete the existing
suspicion on member 0, as suggested by the first event, which happened
about one suspicion timeout before the DEAD event.
Fundamentally, such suspicions (and perhaps other swim_item objects)
should be about specific incarnations, not just member IDs. But that
might require a larger change. In this patch, we offer a quick fix,
leaving the larger change to a future patch.
Signed-off-by: Li Wei <liwei@hpe.com>
c123444 to
61d7afb
Compare
Contributor
Author
|
@frostedcmos, @jgmoore-or, ping... (Sorry, but I think this should go into the upcoming 2.8.0.) |
Contributor
If you think this should be in 2.8.0 then I would suggest trying to get approval for the ticket ASAP because there is little time left. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We've observed the following events:
PG version 122 includes a rejoined member 0, whose incarnation changed from 2747111301814681600 to 2747126810886799360. Since the suspicion timeout was 20 s, how could there be a DEAD event for the newer incarnation of member 0 within less than 5 s since the PG update? The problem is that crt_swim_rank_check forgets to delete the existing suspicion on member 0, as suggested by the first event, which happened about one suspicion timeout before the DEAD event.
Fundamentally, such suspicions (and perhaps other swim_item objects) should be about specific incarnations, not just member IDs. But that might require a larger change. In this patch, we offer a quick fix, leaving the larger change to a future patch.
Steps for the author:
After all prior steps are complete: