Skip to content

DAOS-19195 container: handle container open race#18541

Draft
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-19195
Draft

DAOS-19195 container: handle container open race#18541
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-19195

Conversation

@Nasf-Fan

@Nasf-Fan Nasf-Fan commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

There are potential race conditions among container open, close and destroy in original container logic. For example:

  1. ULT_a is trying to open the container (that is triggered via some asynchronous IV message, maybe delayed or stale), related @hdl is created, then it is blocked on ds_cont_child::sc_open_mutex before increasing container open counter ds_cont_child::sc_open. At that time, container open handle does not match ds_cont_child::sc_open.

  2. ULT_b is trying to close the container, remove above @hdl from the handle list, but before decreasing ds_cont_child::sc_open, is also blocked on the ds_cont_child::sc_open_mutex. So the assertion check "D_ASSERT(cont_child->sc_open > 0);" in cont_close_hdl() may be not triggered.

  3. ULT_c is trying to destroy the container. It calls cont_child_stop() to close all open handles against the container. Because ULT_b has removed such incompleted handle (from ULT_a), then cont_child_stop() logic moves to dtx_cont_deregister(). As such point, there should be nobody open the container (cont_child->sc_open should be zero). But unfortunately, ULT_a does not increase cont_child->sc_open yet. Then related check "D_ASSERT(!dtx_cont_opened(cont));" failed.

It is just one possible corner, maybe not all cases. The root issue is that there maybe CPU yield between adding the handle into the container open handle list and increasing the open counter. The solution is this patch extends ds_cont_child::sc_open_mutex protect range to cover the process of adding open handle into the container open handle list. Then anytime, the count of open handles will always match the open counter.

It also cleanup the code a bit.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Ticket title is 'Aurora MDonSSD with 2.8.0-rc1 dtx_cont_deregister() Assertion '!dtx_cont_opened(cont)' failed'
Status is 'In Progress'
Labels: '2.8.0rc1,md_on_ssd,request_for_2.8,test_2.8.0rc'
https://daosio.atlassian.net/browse/DAOS-19195

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19195 branch 2 times, most recently from d67c60d to 629ab5a Compare June 25, 2026 04:57
There are potential race conditions among container open, close and
destroy in original container logic. For example:

1. ULT_a is trying to open the container (that is triggered via some
   asynchronous IV message, maybe delayed or stale), related @hdl is
   created, then it is blocked on ds_cont_child::sc_open_mutex before
   increasing container open counter ds_cont_child::sc_open. At that
   time, container open handle does not match ds_cont_child::sc_open.

2. ULT_b is trying to close the container, remove above @hdl from the
   handle list, but before decreasing ds_cont_child::sc_open, is also
   blocked on the ds_cont_child::sc_open_mutex. So the assertion check
   "D_ASSERT(cont_child->sc_open > 0);" in cont_close_hdl() may be not
   triggered.

3. ULT_c is trying to destroy the container. It calls cont_child_stop()
   to close all open handles against the container. Because ULT_b has
   removed such uncompleted handle (from ULT_a), then cont_child_stop()
   logic moves to dtx_cont_deregister(). As such point, there should be
   nobody open the container (cont_child->sc_open should be zero). But
   unfortunately, ULT_a does not increase cont_child->sc_open yet. Then
   related check "D_ASSERT(!dtx_cont_opened(cont));" failed.

It is just one possible corner, maybe not all cases. The root issue is
that there maybe CPU yield between adding the handle into the container
open handle list and increasing the open counter. The solution in this
patch extends ds_cont_child::sc_open_mutex protect range to cover the
process of adding open handle into the list. Then anytime, the count
of open handles will always match the open counter.

It also cleanuo the code a bit.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19195 branch from 629ab5a to 0d3da37 Compare June 25, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant