DAOS-19195 container: handle container open race#18541
Draft
Nasf-Fan wants to merge 1 commit into
Draft
Conversation
|
Ticket title is 'Aurora MDonSSD with 2.8.0-rc1 dtx_cont_deregister() Assertion '!dtx_cont_opened(cont)' failed' |
d67c60d to
629ab5a
Compare
There are potential race conditions among container open, close and destroy in original container logic. For example: 1. ULT_a is trying to open the container (that is triggered via some asynchronous IV message, maybe delayed or stale), related @hdl is created, then it is blocked on ds_cont_child::sc_open_mutex before increasing container open counter ds_cont_child::sc_open. At that time, container open handle does not match ds_cont_child::sc_open. 2. ULT_b is trying to close the container, remove above @hdl from the handle list, but before decreasing ds_cont_child::sc_open, is also blocked on the ds_cont_child::sc_open_mutex. So the assertion check "D_ASSERT(cont_child->sc_open > 0);" in cont_close_hdl() may be not triggered. 3. ULT_c is trying to destroy the container. It calls cont_child_stop() to close all open handles against the container. Because ULT_b has removed such uncompleted handle (from ULT_a), then cont_child_stop() logic moves to dtx_cont_deregister(). As such point, there should be nobody open the container (cont_child->sc_open should be zero). But unfortunately, ULT_a does not increase cont_child->sc_open yet. Then related check "D_ASSERT(!dtx_cont_opened(cont));" failed. It is just one possible corner, maybe not all cases. The root issue is that there maybe CPU yield between adding the handle into the container open handle list and increasing the open counter. The solution in this patch extends ds_cont_child::sc_open_mutex protect range to cover the process of adding open handle into the list. Then anytime, the count of open handles will always match the open counter. It also cleanuo the code a bit. Signed-off-by: Fan Yong <fan.yong@hpe.com>
629ab5a to
0d3da37
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There are potential race conditions among container open, close and destroy in original container logic. For example:
ULT_a is trying to open the container (that is triggered via some asynchronous IV message, maybe delayed or stale), related @hdl is created, then it is blocked on ds_cont_child::sc_open_mutex before increasing container open counter ds_cont_child::sc_open. At that time, container open handle does not match ds_cont_child::sc_open.
ULT_b is trying to close the container, remove above @hdl from the handle list, but before decreasing ds_cont_child::sc_open, is also blocked on the ds_cont_child::sc_open_mutex. So the assertion check "D_ASSERT(cont_child->sc_open > 0);" in cont_close_hdl() may be not triggered.
ULT_c is trying to destroy the container. It calls cont_child_stop() to close all open handles against the container. Because ULT_b has removed such incompleted handle (from ULT_a), then cont_child_stop() logic moves to dtx_cont_deregister(). As such point, there should be nobody open the container (cont_child->sc_open should be zero). But unfortunately, ULT_a does not increase cont_child->sc_open yet. Then related check "D_ASSERT(!dtx_cont_opened(cont));" failed.
It is just one possible corner, maybe not all cases. The root issue is that there maybe CPU yield between adding the handle into the container open handle list and increasing the open counter. The solution is this patch extends ds_cont_child::sc_open_mutex protect range to cover the process of adding open handle into the container open handle list. Then anytime, the count of open handles will always match the open counter.
It also cleanup the code a bit.
Steps for the author:
After all prior steps are complete: