Skip to content

DAOS-17321 ddb: Add checksum dump command to ddb C API#18543

Open
knard38 wants to merge 4 commits into
masterfrom
ckochhof/dev/master/daos-17321/patch-003
Open

DAOS-17321 ddb: Add checksum dump command to ddb C API#18543
knard38 wants to merge 4 commits into
masterfrom
ckochhof/dev/master/daos-17321/patch-003

Conversation

@knard38

@knard38 knard38 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Context

Third patch in the DAOS-17321 series:

  1. [DAOS-17321 ddb: Add checksum dump function to ddb vos API #18293] Added VOS_OF_FETCH_CSUM to vos_fetch_begin() to retrieve per-extent checksum metadata without fetching data, with unit tests VOS400.1, VOS401.1, VOS401.2.
  2. [DAOS-17321 ddb: Add checksum dump function to ddb VOS API #18444] Added dv_dump_csum() to the ddb VOS API: fetches checksum metadata via VOS_OF_FETCH_CSUM and delivers it to the caller through a dv_dump_csum_cb callback.
  3. This patch — adds the ddb C command API entry point on top of dv_dump_csum(), extends the VOS layer to expose the actual stored epoch of a fetched single value, and provides comprehensive unit and integration tests.

Changes

ddb_run_csum_dump() — new ddb C API command (ddb.h, ddb_commands.c)

ddb_run_csum_dump(struct ddb_ctx *ctx, struct csum_dump_options *opt) is the top-level command function. The option struct exposes three fields:

  • path — VOS tree path to the akey (required)
  • epoch — epoch for the fetch (DAOS_EPOCH_MAX for the latest visible version)
  • dst — optional output file; if set, raw checksum bytes are written to disk rather than printed

The function resolves the path and dispatches to one of four internal callbacks depending on the akey type (single-value vs array) and the presence of dst:

Print to terminal Write to file
Single value print_csum_sv write_file_csum_sv
Array print_csum_recx write_file_csum_recx

Single-value output — type, length, actual stored epoch, and checksum value:

Type: crc64, Length: 8, Epoch: 42, Value: 0xdeadbeef01020304

Array output — per extent: index range, record size, epoch (from the recx/epoch list), and one checksum value per chunk:

Checksum Type: crc16, Checksum Length: 2, Chunk Size: 64, Record Extent(s):
- Record Indexes: {0-127}, Record Size: 1, Epoch: 1, Checksum Value(s): 0x1234, 0x5678

Updated dv_dump_csum_cb callback (ddb_vos.h, ddb_vos.c)

The callback signature now uses two typed parameters that clearly distinguish the per-akey-type context:

typedef int (*dv_dump_csum_cb)(void *cb_arg,
                               struct daos_recx_ep_list *recx_rel,
                               daos_epoch_t              sv_epoch,
                               struct dcs_ci_list       *cil);
  • Array akeys: recx_rel non-NULL (the existing recx/epoch list), sv_epoch 0.
  • Single-value akeys: recx_rel NULL, sv_epoch the actual stored epoch (see below).

VOS: actual stored SV epoch in the fetch handle (vos_io.c, vos.h)

vos_fetch_begin() with VOS_OF_FETCH_CSUM now records the epoch of the single value found during the B-tree walk, so the caller knows which version was retrieved.

  • ic_sv_epoch (daos_epoch_t) added to vos_io_context, zero-initialized by calloc. Set in akey_fetch_single() when a real SV is found within the valid epoch range (holes, DER_NONEXIST, and uncertainty violations leave it at 0, consistent with the DAOS convention that 0 is the "not set" epoch sentinel). For
    array akeys akey_fetch_single() is never entered, so ic_sv_epoch stays 0.
  • vos_ioh2sv_epoch(daos_handle_t ioh) public accessor added to vos.h/vos_io.c.

Tests

VOS layer (vts_io.c):

  • VOS400.1 (pre-existing, extended): write epoch changed to 42 for unambiguity; assert_int_equal(vos_ioh2sv_epoch(ioh), 42) added alongside the existing csum data checks.
  • VOS400.2 (new): "vos_fetch_begin records the stored SV epoch in the fetch handle" — two SV versions at epochs 10 and 20, four fetch scenarios: DAOS_EPOCH_MAX → 20, epoch 15 → 10 (LE probe), epoch 10 → 10 (exact match), epoch 9 → 0 (key not yet visible).

ddb VOS API (ddb_vos_tests.c): all five dv_dump_csum_cb test callbacks updated to the new signature; SV callbacks assert the expected sv_epoch values (1 and 2); RECX callbacks assert sv_epoch == 0.

ddb C API (ddb_commands_tests.c): five new tests under a dedicated csum suite with its own VOS pool setup/teardown:

  • csum_dump_error_tests — invalid path, incomplete path, invalid container
  • print_csum_sv_tests — no-csum case, EPOCH_MAX (epoch 2), and epoch 1; checks epoch value, csum type, and csum bytes
  • write_csum_sv_tests — "Dumping checksum" log line and raw bytes written to a mock file
  • print_csum_recx_tests — multi-extent output with per-extent epoch and csum values
  • write_csum_recx_tests — multi-extent file write

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@knard38 knard38 self-assigned this Jun 25, 2026
@knard38 knard38 added the CR Catastrophic Recovery Feature label Jun 25, 2026
@github-actions

Copy link
Copy Markdown

Ticket title is 'Checksum management with ddb'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-17321

Add ddb_run_csum_dump() to the ddb C API to dump checksum information
for a given VOS path.

Extend vos_fetch_begin() to expose the actual stored epoch of the single
value found during a VOS_OF_FETCH_CSUM fetch:
- Add ic_sv_epoch to vos_io_context, populated in akey_fetch_single()
  when a real SV is found within the valid epoch range (0 otherwise).
- Add vos_ioh2sv_epoch() public accessor.

The dv_dump_csum_cb callback takes two typed parameters:
  struct daos_recx_ep_list *recx_rel  — non-NULL for array akeys
  daos_epoch_t              sv_epoch  — actual stored epoch for SV akeys

The print_csum_sv / write_file_csum_sv functions display the stored
epoch alongside the checksum type, length, and value.

Tests:
- VOS400.2: vos_fetch_begin records the stored SV epoch in the fetch
  handle — four sub-cases: EPOCH_MAX, LE probe, exact match, not found.
- Updated ddb VOS-level callbacks to assert sv_epoch values.
- Updated ddb command-level tests to verify epoch in printed output.

Features: recovery
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
@knard38 knard38 force-pushed the ckochhof/dev/master/daos-17321/patch-003 branch from 25cd7f6 to edd5452 Compare June 25, 2026 09:28
@knard38 knard38 marked this pull request as ready for review June 25, 2026 09:28
@knard38 knard38 requested review from a team as code owners June 25, 2026 09:28
@knard38 knard38 requested review from Nasf-Fan and NiuYawei June 25, 2026 09:29
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18543/3/execution/node/1446/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18543/3/execution/node/1436/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18543/4/execution/node/1641/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18543/4/execution/node/1600/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18543/4/execution/node/1690/log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CR Catastrophic Recovery Feature

Development

Successfully merging this pull request may close these issues.

4 participants