Skip to content

DAOS-19195 container: handle container open race - b28#18542

Open
Nasf-Fan wants to merge 1 commit into
release/2.8from
Nasf-Fan/DAOS-19195_b28
Open

DAOS-19195 container: handle container open race - b28#18542
Nasf-Fan wants to merge 1 commit into
release/2.8from
Nasf-Fan/DAOS-19195_b28

Conversation

@Nasf-Fan

@Nasf-Fan Nasf-Fan commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

There are potential race conditions among container open, close and destroy in original container logic. For example:

  1. ULT_a is trying to open the container (that is triggered via some asynchronous IV message, maybe delayed or stale), related @hdl is created, then it is blocked on ds_cont_child::sc_open_mutex before increasing container open counter ds_cont_child::sc_open. At that time, container open handle does not match ds_cont_child::sc_open.

  2. ULT_b is trying to close the container, remove above @hdl from the handle list, but before decreasing ds_cont_child::sc_open, is also blocked on the ds_cont_child::sc_open_mutex. So the assertion check "D_ASSERT(cont_child->sc_open > 0);" in cont_close_hdl() may be not triggered.

  3. ULT_a is waken up on ds_cont_child::sc_open_mutex and increases open counter. At that time, ds_cont_child::sc_open is non-zero although related open handle has already been removed from open handle list.

  4. ULT_c is trying to destroy the container. It calls cont_child_stop() to close all open handles against the container. Because ULT_b has removed such handle (from ULT_a), then cont_child_stop() logic moves to dtx_cont_deregister(). As such point, there should be nobody open the container (cont_child->sc_open should be zero). But unfortunately, ULT_a's behavior in step 3 breaks such assumption. Then related check "D_ASSERT(!dtx_cont_opened(cont));" failed.

It is just one possible corner, maybe not all cases. The root issue is that there maybe CPU yield between adding the handle into the container open handle list and increasing the open counter. The solution in this patch extends ds_cont_child::sc_open_mutex protect range to cover the process of adding open handle into the list. Then anytime, the count of open handles will always match the open counter.

It also cleanup the code a bit.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Ticket title is 'Aurora MDonSSD with 2.8.0-rc1 dtx_cont_deregister() Assertion '!dtx_cont_opened(cont)' failed'
Status is 'In Review'
Labels: '2.8.0rc1,md_on_ssd,request_for_2.8,scrubbed_2.8,test_2.8.0rc'
https://daosio.atlassian.net/browse/DAOS-19195

@Nasf-Fan Nasf-Fan marked this pull request as ready for review June 29, 2026 04:27
@Nasf-Fan Nasf-Fan requested review from a team as code owners June 29, 2026 04:27
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19195_b28 branch from 73563fc to 62bc723 Compare June 30, 2026 05:43
There are potential race conditions among container open, close and
destroy in original container logic. For example:

1. ULT_a is trying to open the container (that is triggered via some
   asynchronous IV message, maybe delayed or stale), related @hdl is
   created, then it is blocked on ds_cont_child::sc_open_mutex before
   increasing container open counter ds_cont_child::sc_open. At that
   time, container open handle does not match ds_cont_child::sc_open.

2. ULT_b is trying to close the container, remove above @hdl from the
   handle list, but before decreasing ds_cont_child::sc_open, is also
   blocked on the ds_cont_child::sc_open_mutex. So the assertion check
   "D_ASSERT(cont_child->sc_open > 0);" in cont_close_hdl() may be not
   triggered.

3. ULT_a is waken up on ds_cont_child::sc_open_mutex and increases open
   counter. At that time, ds_cont_child::sc_open is non-zero although
   related open handle has already been removed from open handle list.

4. ULT_c is trying to destroy the container. It calls cont_child_stop()
   to close all open handles against the container. Because ULT_b has
   removed such handle (from ULT_a), then cont_child_stop() logic moves
   to dtx_cont_deregister(). As such point, there should be nobody open
   the container (cont_child->sc_open should be zero). But unfortunately,
   ULT_a's behavior in step 3 breaks such assumption. Then related check
   "D_ASSERT(!dtx_cont_opened(cont));" failed.

It is just one possible corner, maybe not all cases. The root issue is
that there maybe CPU yield between adding the handle into the container
open handle list and increasing the open counter. The solution in this
patch extends ds_cont_child::sc_open_mutex protect range to cover the
process of adding open handle into the list. Then anytime, the count
of open handles will always match the open counter.

It also cleanuo the code a bit.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19195_b28 branch from 62bc723 to af40f74 Compare June 30, 2026 10:21
@daosbuild3

Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants