feat: add KB mutation locks and atomic state writes by gwokhou · Pull Request #86 · VectifyAI/OpenKB

gwokhou · 2026-06-03T10:41:05Z

Summary

This PR adds cooperative KB-level locking and atomic state/config writes.

It is both a reliability hardening change for the current CLI and a foundation for future concurrent ingestion / compilation work.

Why

Several OpenKB commands mutate shared KB files such as .openkb/hashes.json, .openkb/config.yaml, wiki/, raw/, and generated reports. If two OpenKB processes modify the same
KB at the same time, or if a process is interrupted while writing state, the KB can be left in a partially written or inconsistent state.

This matters today for commands such as openkb add, openkb remove, openkb recompile, openkb watch, and openkb lint --fix.

It also matters for planned concurrency improvements. Before OpenKB can safely run more ingestion or compilation work in parallel, it needs a clear synchronization boundary around
KB-level mutations and a safer persistence primitive for state files. This PR introduces that boundary without changing the public CLI interface.

I did not find an existing upstream issue that directly tracks this race condition, so this is a preventive hardening / enabling change.

Changes

Add openkb.locks with KB-level advisory locks:
- exclusive locks for mutation commands
- shared locks for read-only commands
- reentrant same-thread lock handling
- explicit rejection of read-to-write lock upgrades
Apply exclusive KB locks to mutating CLI commands:
- add
- remove
- recompile
- watch
- lint
Apply shared KB locks to read-only CLI commands:
- list
- status
Add atomic write helpers for text, JSON, and binary writes.
Use atomic writes for:
- KB config writes
- global config writes
- hash registry persistence
Add focused tests for lock reentrancy, read/write upgrade behavior, nested read locks, and atomic writes.

Notes

The lock is advisory and cooperative: it protects OpenKB commands that use these helpers. It does not attempt to make arbitrary external edits to the KB directory safe.

This PR intentionally keeps the scope narrow. It does not introduce parallel ingestion yet; it adds the synchronization and persistence primitives needed before that work can be
done safely.

Testing

Added tests/test_locks.py covering lock behavior and atomic write helpers.

KylinMountain

Review: KB mutation locks & atomic state writes

The lock + atomic-write primitives are well built (clean reentrancy accounting, correct os.replace pattern, good tests). My concern is mostly where the locks are applied — a couple of high-impact gaps make the protection ineffective or actively disruptive in the most common usage. Findings ranked by severity.

🔴 1. `watch` holds the exclusive lock for its entire lifetime

watch is decorated @_with_kb_lock(exclusive=True), but watch_directory() blocks indefinitely (while observer.is_alive(): observer.join(...)). So the exclusive lock is held for the whole watch session — every other command, including read-only list/status (shared lock, blocked by the exclusive holder), is locked out until watch is killed.
→ Take the lock per-ingestion inside on_new_files (around each add_single_file), not around the whole command.

🔴 2. `chat` bypasses the lock protocol entirely

chat has no _with_kb_lock decorator, yet its REPL mutates the KB via /add (add_single_file, chat.py:495/500) and /lint (run_lint, chat.py:752). Those functions hold no internal lock — locking lives only in the CLI decorator. So a /add inside an open chat session races freely against a concurrent openkb add (which did take the exclusive lock): the two ingests interleave and one's concept-page / hashes.json writes silently clobber the other's. The exclusive lock on CLI add is worthless against this path.
→ Either lock the chat mutation operations, or push locking down into add_single_file/run_lint.

🟠 3. `register_kb` global-config read-modify-write is unguarded (wrong lock altitude)

register_kb does load_global_config() → append → save_global_config() on the shared ~/.config/openkb/global.yaml. The new locks are per-KB (.openkb/ingest.lock), and init/use aren't decorated at all. atomic_write_text makes each write un-torn but does not serialize the load→append→save sequence. Two concurrent init/use in different KBs → both read the same known_kbs, each appends its own path, last writer wins → the other registration (and default_kb) is silently lost. This is exactly the lost-update the PR title implies it prevents, but the lock is scoped to the wrong file. Consider a global-config lock (or pushing the lock inside register_kb).

🟠 4. `lint` holds the exclusive lock across a multi-minute LLM run

lint is exclusive, but without --fix it's read-only, and run_lint invokes an LLM knowledge-lint agent that can run for minutes — all while holding the write lock. Every list/status from another process blocks for that whole window. Use a shared lock for read-only lint, or take the write lock only around the --fix link rewrite and release before the LLM phase.

🟡 5. `atomic_write_*` never fsyncs the parent directory after `os.replace`

The temp file is fsync'd, but the directory entry (the rename) is not. On ext4/xfs a crash right after os.replace can lose the rename while the data survives — a reader after reboot sees the old content even though the call "succeeded". Atomicity holds; durability of the commit doesn't. Add an os.open(parent, O_DIRECTORY) + os.fsync after replace. (atomic_write_binary has the same gap.)

🟡 6. `mkstemp` silently tightens file permissions to 0600

tempfile.mkstemp creates the temp at 0600 and os.replace carries that onto the destination, ignoring umask. A config.yaml that was 0644 becomes owner-only after the first save_config/save_global_config rewrite, and every new file is 0600 regardless of umask — a behavior change from the old path.open("w"). chmod the temp to match the existing file's mode (or to 0o666 & ~umask) before replace.

🟡 7. Shared read lock is serialized within a process

The per-path RLock is held for the whole lock section, so two threads in the same process can never hold LOCK_SH concurrently — first-acquire from a second thread blocks on the RLock before reaching the OS shared lock. Cross-process readers are fine, so impact is limited today, but the "shared read lock" doesn't deliver shared semantics in-process (relevant if chat/agent ever reads concurrently across threads).

🟡 8. `fcntl.flock` is silently ineffective on network filesystems

KBs are often kept on NFS / Dropbox / iCloud Drive. flock may be a no-op there; the code records the lock as held and proceeds with no detection, so two hosts can mutate the same KB concurrently and corrupt state. Worth documenting the limitation, or detecting non-local backing stores.

⚪ 9. Dead code + divergent KB-detection predicate

maybe_kb_ingest_lock, maybe_kb_read_lock, and atomic_write_binary are defined but referenced nowhere. The decorator reimplements "lock only if a KB exists" inline using a different predicate (_find_kb_dir(...) is None) than maybe_* ((kb_dir/'.openkb').is_dir()). A future caller reaching for the conveniently-named maybe_* would get different behavior (no cwd-walk / global-default resolution) and could lock the wrong directory. Suggest deleting the unused helpers and keeping one source of truth.

⚪ 10. `hashes.json` format flip on first write

atomic_write_json appends a trailing newline, so _persist writes {...}\n, but init seeds the file via json.dumps({}) with no newline. The byte format flips on the first add() (spurious diff / checksum mismatch). Route the init seed through atomic_write_json too. (Minor related note: failed writes under SIGKILL/ENOSPC can leave hidden .<name>.*.tmp orphans in .openkb/ that nothing sweeps.)

#1 and #2 are the ones I'd block on — they make the locking ineffective (chat) or actively disruptive (watch) in normal use. The rest are incremental hardening.

gwokhou added 3 commits June 3, 2026 18:32

feat: add KB lock primitives

287f48d

chore: make KB state writes atomic

3ff40d8

feat: apply KB locks to CLI commands

5c4cfbc

KylinMountain reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add KB mutation locks and atomic state writes#86

feat: add KB mutation locks and atomic state writes#86
gwokhou wants to merge 3 commits into
VectifyAI:mainfrom
gwokhou:pr/kb-mutation-locking

gwokhou commented Jun 3, 2026

Uh oh!

KylinMountain left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gwokhou commented Jun 3, 2026

Summary

Why

Changes

Notes

Testing

Uh oh!

KylinMountain left a comment

Choose a reason for hiding this comment

Review: KB mutation locks & atomic state writes

🔴 1. watch holds the exclusive lock for its entire lifetime

🔴 2. chat bypasses the lock protocol entirely

🟠 3. register_kb global-config read-modify-write is unguarded (wrong lock altitude)

🟠 4. lint holds the exclusive lock across a multi-minute LLM run

🟡 5. atomic_write_* never fsyncs the parent directory after os.replace

🟡 6. mkstemp silently tightens file permissions to 0600

🟡 7. Shared read lock is serialized within a process

🟡 8. fcntl.flock is silently ineffective on network filesystems

⚪ 9. Dead code + divergent KB-detection predicate

⚪ 10. hashes.json format flip on first write

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔴 1. `watch` holds the exclusive lock for its entire lifetime

🔴 2. `chat` bypasses the lock protocol entirely

🟠 3. `register_kb` global-config read-modify-write is unguarded (wrong lock altitude)

🟠 4. `lint` holds the exclusive lock across a multi-minute LLM run

🟡 5. `atomic_write_*` never fsyncs the parent directory after `os.replace`

🟡 6. `mkstemp` silently tightens file permissions to 0600

🟡 8. `fcntl.flock` is silently ineffective on network filesystems

⚪ 10. `hashes.json` format flip on first write