feat: add KB mutation locks and atomic state writes#86
Conversation
KylinMountain
left a comment
There was a problem hiding this comment.
Review: KB mutation locks & atomic state writes
The lock + atomic-write primitives are well built (clean reentrancy accounting, correct os.replace pattern, good tests). My concern is mostly where the locks are applied — a couple of high-impact gaps make the protection ineffective or actively disruptive in the most common usage. Findings ranked by severity.
🔴 1. watch holds the exclusive lock for its entire lifetime
watch is decorated @_with_kb_lock(exclusive=True), but watch_directory() blocks indefinitely (while observer.is_alive(): observer.join(...)). So the exclusive lock is held for the whole watch session — every other command, including read-only list/status (shared lock, blocked by the exclusive holder), is locked out until watch is killed.
→ Take the lock per-ingestion inside on_new_files (around each add_single_file), not around the whole command.
🔴 2. chat bypasses the lock protocol entirely
chat has no _with_kb_lock decorator, yet its REPL mutates the KB via /add (add_single_file, chat.py:495/500) and /lint (run_lint, chat.py:752). Those functions hold no internal lock — locking lives only in the CLI decorator. So a /add inside an open chat session races freely against a concurrent openkb add (which did take the exclusive lock): the two ingests interleave and one's concept-page / hashes.json writes silently clobber the other's. The exclusive lock on CLI add is worthless against this path.
→ Either lock the chat mutation operations, or push locking down into add_single_file/run_lint.
🟠 3. register_kb global-config read-modify-write is unguarded (wrong lock altitude)
register_kb does load_global_config() → append → save_global_config() on the shared ~/.config/openkb/global.yaml. The new locks are per-KB (.openkb/ingest.lock), and init/use aren't decorated at all. atomic_write_text makes each write un-torn but does not serialize the load→append→save sequence. Two concurrent init/use in different KBs → both read the same known_kbs, each appends its own path, last writer wins → the other registration (and default_kb) is silently lost. This is exactly the lost-update the PR title implies it prevents, but the lock is scoped to the wrong file. Consider a global-config lock (or pushing the lock inside register_kb).
🟠 4. lint holds the exclusive lock across a multi-minute LLM run
lint is exclusive, but without --fix it's read-only, and run_lint invokes an LLM knowledge-lint agent that can run for minutes — all while holding the write lock. Every list/status from another process blocks for that whole window. Use a shared lock for read-only lint, or take the write lock only around the --fix link rewrite and release before the LLM phase.
🟡 5. atomic_write_* never fsyncs the parent directory after os.replace
The temp file is fsync'd, but the directory entry (the rename) is not. On ext4/xfs a crash right after os.replace can lose the rename while the data survives — a reader after reboot sees the old content even though the call "succeeded". Atomicity holds; durability of the commit doesn't. Add an os.open(parent, O_DIRECTORY) + os.fsync after replace. (atomic_write_binary has the same gap.)
🟡 6. mkstemp silently tightens file permissions to 0600
tempfile.mkstemp creates the temp at 0600 and os.replace carries that onto the destination, ignoring umask. A config.yaml that was 0644 becomes owner-only after the first save_config/save_global_config rewrite, and every new file is 0600 regardless of umask — a behavior change from the old path.open("w"). chmod the temp to match the existing file's mode (or to 0o666 & ~umask) before replace.
🟡 7. Shared read lock is serialized within a process
The per-path RLock is held for the whole lock section, so two threads in the same process can never hold LOCK_SH concurrently — first-acquire from a second thread blocks on the RLock before reaching the OS shared lock. Cross-process readers are fine, so impact is limited today, but the "shared read lock" doesn't deliver shared semantics in-process (relevant if chat/agent ever reads concurrently across threads).
🟡 8. fcntl.flock is silently ineffective on network filesystems
KBs are often kept on NFS / Dropbox / iCloud Drive. flock may be a no-op there; the code records the lock as held and proceeds with no detection, so two hosts can mutate the same KB concurrently and corrupt state. Worth documenting the limitation, or detecting non-local backing stores.
⚪ 9. Dead code + divergent KB-detection predicate
maybe_kb_ingest_lock, maybe_kb_read_lock, and atomic_write_binary are defined but referenced nowhere. The decorator reimplements "lock only if a KB exists" inline using a different predicate (_find_kb_dir(...) is None) than maybe_* ((kb_dir/'.openkb').is_dir()). A future caller reaching for the conveniently-named maybe_* would get different behavior (no cwd-walk / global-default resolution) and could lock the wrong directory. Suggest deleting the unused helpers and keeping one source of truth.
⚪ 10. hashes.json format flip on first write
atomic_write_json appends a trailing newline, so _persist writes {...}\n, but init seeds the file via json.dumps({}) with no newline. The byte format flips on the first add() (spurious diff / checksum mismatch). Route the init seed through atomic_write_json too. (Minor related note: failed writes under SIGKILL/ENOSPC can leave hidden .<name>.*.tmp orphans in .openkb/ that nothing sweeps.)
#1 and #2 are the ones I'd block on — they make the locking ineffective (chat) or actively disruptive (watch) in normal use. The rest are incremental hardening.
Summary
This PR adds cooperative KB-level locking and atomic state/config writes.
It is both a reliability hardening change for the current CLI and a foundation for future concurrent ingestion / compilation work.
Why
Several OpenKB commands mutate shared KB files such as
.openkb/hashes.json,.openkb/config.yaml,wiki/,raw/, and generated reports. If two OpenKB processes modify the sameKB at the same time, or if a process is interrupted while writing state, the KB can be left in a partially written or inconsistent state.
This matters today for commands such as
openkb add,openkb remove,openkb recompile,openkb watch, andopenkb lint --fix.It also matters for planned concurrency improvements. Before OpenKB can safely run more ingestion or compilation work in parallel, it needs a clear synchronization boundary around
KB-level mutations and a safer persistence primitive for state files. This PR introduces that boundary without changing the public CLI interface.
I did not find an existing upstream issue that directly tracks this race condition, so this is a preventive hardening / enabling change.
Changes
openkb.lockswith KB-level advisory locks:addremoverecompilewatchlintliststatusNotes
The lock is advisory and cooperative: it protects OpenKB commands that use these helpers. It does not attempt to make arbitrary external edits to the KB directory safe.
This PR intentionally keeps the scope narrow. It does not introduce parallel ingestion yet; it adds the synchronization and persistence primitives needed before that work can be
done safely.
Testing
tests/test_locks.pycovering lock behavior and atomic write helpers.