Skip to content

fix(staged_insert): converge metadata shape with ObjectCodec.encode#1465

Open
dimitri-yatsenko wants to merge 1 commit into
masterfrom
fix/staged-insert-spec-compliance
Open

fix(staged_insert): converge metadata shape with ObjectCodec.encode#1465
dimitri-yatsenko wants to merge 1 commit into
masterfrom
fix/staged-insert-spec-compliance

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Summary

Bring `staged_insert1` into compliance with the Staged Insert Specification (datajoint-docs#177). The spec defines a normative metadata-dict shape; today's implementation diverges from it in two ways that are observable in the database.

Slated for the DataJoint 2.3 release.

What changes

1. Metadata shape now matches ObjectCodec.encode exactly.

The same content stored via the two insert paths used to yield different column dicts:

Path Dict shape
ObjectCodec.encode (builtin_codecs/object.py:166-174) {path, store, size, ext, is_dir, item_count, timestamp}
Staged, directory (old) {path, size, hash, ext, is_dir, item_count, timestamp}
Staged, single file (old) {path, size, hash, ext, is_dir, timestamp, mime_type?}

After this PR, _compute_metadata returns the encode shape in both cases:

  • drops hash: None (never carried information)
  • drops mime_type (not in encode shape)
  • file case adds item_count: None
  • both cases add store: <resolved store name>

2. Per-field store resolution.

Today's staged_insert resolves the backend from stores.default regardless of the field's type spec. A field declared <object@local> would write through stores.default, and the store key recorded in the row's metadata column would name the wrong store (or be missing entirely). Now we resolve store_name from attr.store via resolve_dtype() and use that for both path/backend resolution and the metadata's store field — matching what ordinary insert1 does through table.py:1342.

3. Drops the .manifest.json sidecar.

The staged path was writing a per-object manifest file that the encode path doesn't produce. The metadata dict already records total size and item_count; per-file listings are recoverable by walking the canonical directory. Removing the sidecar eliminates a small storage-layout divergence between the two paths.

4. Docstring fix.

Module docstring and the staged_insert1 function docstring both showed staged.rec['raw_data'] = z. Per the spec, the caller does not assign anything to staged.rec[<staged field>] — the framework computes the metadata dict. The implementation already overwrote the user's assignment in _finalize, so this was harmless but misleading. Examples updated.

Test added

tests/integration/test_object.py::TestStagedInsert::test_staged_insert_metadata_shape_matches_encode — exercises both the single-file and directory paths, captures the metadata dict assigned to the row, and asserts:

  • Keys match ObjectCodec.encode's output exactly for equivalent content
  • Per-key values agree (store, ext, is_dir, item_count)

Test plan

  • All 6 staged-insert tests pass (including the new one)
  • All 44 tests in tests/integration/test_object.py pass
  • CI green
  • Re-review by codeowners against the spec

Related

Bring the staged-insert metadata dict into structural equality with the
ordinary insert1 path, per the Staged Insert Specification
(datajoint-docs#177). Without this change the same content stored via
the two paths yields different column dicts:

  ObjectCodec.encode  -> {path, store, size, ext, is_dir, item_count, timestamp}
  staged (directory)  -> {path,        size, hash, ext, is_dir, item_count, timestamp}
  staged (single file)-> {path,        size, hash, ext, is_dir,             timestamp, mime_type?}

The staged path is now the canonical shape. Drops:
- hash: None              (never carried information)
- mime_type (file case)   (not in encode shape)
The file case now also carries item_count: None (matching encode).

Also fix the store_name divergence: staged_insert resolved the backend
from stores.default regardless of the field's type spec. A field declared
<object@local> would write through stores.default — and the store key
recorded in the metadata column would point at the wrong store. Now
resolve store_name from attr.store via resolve_dtype() and use that for
both path/backend resolution and the metadata's store field.

Drop the .manifest.json sidecar that the staged path wrote and the
encode path didn't. The metadata dict already records total size and
item_count; per-file listings are recoverable by walking the canonical
directory if ever needed.

Fix docstrings that showed `staged.rec['raw_data'] = z` — the framework
computes the metadata; the caller does not assign anything to the
staged field on staged.rec.

Add tests/integration/test_object.py::TestStagedInsert::
test_staged_insert_metadata_shape_matches_encode covering both the
single-file and directory cases against ObjectCodec.encode for
equivalent content.

Slated for DataJoint 2.3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant