Duplicate entry hardening#2096
Conversation
|
/submit |
|
Submitted as pull.2096.git.1776731171.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
This branch is now known as |
|
This patch series was integrated into seen via git@5ceb988. |
|
This patch series was integrated into seen via git@b2979e0. |
|
This patch series was integrated into seen via git@5a8e1e0. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Needs review. source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@62d334b. |
|
This patch series was integrated into seen via git@6099d30. |
|
This patch series was integrated into seen via git@1d317d6. |
|
This patch series was integrated into seen via git@a1e3c69. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Needs review. source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@b7807b6. |
|
This patch series was integrated into seen via git@8cb15cc. |
|
This patch series was integrated into seen via git@a090bdb. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Needs review. source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@64f0aa6. |
|
This patch series was integrated into seen via git@1300a5a. |
|
This patch series was integrated into seen via git@b3f3685. |
|
This patch series was integrated into seen via git@fc35baf. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Needs review. source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@5edc804. |
|
This patch series was integrated into seen via git@a510b62. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Needs review. source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@3924c8d. |
|
This patch series was integrated into seen via git@4944388. |
|
This patch series was integrated into seen via git@c7d9b18. |
|
This patch series was integrated into seen via git@798f6ed. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Needs review. source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> We had some corrupt trees with duplicate entries in real world repositories,
> which triggered an assertion failure in merge-ort. Further, the corrupt tree
> creation in the third party tool would have been avoided had verify_cache()
> correctly checked for D/F conflicts. Provide fixes for both issues,
> including 3 preparatory changes for the merge-ort fix.
>
> Elijah Newren (5):
> merge-ort: propagate callback errors from traverse_trees_wrapper()
> merge-ort: drop unnecessary show_all_errors from collect_merge_info()
> merge-ort: free diff pairs queue in clear_or_reinit_internal_opts()
> merge-ort: abort merge when trees have duplicate entries
> cache-tree: fix verify_cache() to catch non-adjacent D/F conflicts
This is a fix to an important corner of our system, but somehow left
in "Needs review" state for much longer than I would have liked, so
even though I am officially on vacation ;-), I took some time to
read these through (by the way it was a pleasant read, thank you).
I wonder if we create a rule like
Those of you who have more than 30 commits in our project are
expected to review one topic (or more) from other contributors
for every three patches you send and ask for reviews by others.
it would help balance the patch vs review ratio, perhaps?
|
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Mon, Jun 01, 2026 at 09:33:10PM +0900, Junio C Hamano wrote:
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > We had some corrupt trees with duplicate entries in real world repositories,
> > which triggered an assertion failure in merge-ort. Further, the corrupt tree
> > creation in the third party tool would have been avoided had verify_cache()
> > correctly checked for D/F conflicts. Provide fixes for both issues,
> > including 3 preparatory changes for the merge-ort fix.
> >
> > Elijah Newren (5):
> > merge-ort: propagate callback errors from traverse_trees_wrapper()
> > merge-ort: drop unnecessary show_all_errors from collect_merge_info()
> > merge-ort: free diff pairs queue in clear_or_reinit_internal_opts()
> > merge-ort: abort merge when trees have duplicate entries
> > cache-tree: fix verify_cache() to catch non-adjacent D/F conflicts
>
> This is a fix to an important corner of our system, but somehow left
> in "Needs review" state for much longer than I would have liked, so
> even though I am officially on vacation ;-), I took some time to
> read these through (by the way it was a pleasant read, thank you).
Honestly, I always shy away from the merge-related subsystems. It has a
lot of subtleties that I don't have any experience with, so I never
really consider my input to be helpful here.
> I wonder if we create a rule like
>
> Those of you who have more than 30 commits in our project are
> expected to review one topic (or more) from other contributors
> for every three patches you send and ask for reviews by others.
Heh, that would make me condense patch series into fewer patches ;)
> it would help balance the patch vs review ratio, perhaps?
It's a good question. I typically try to aim for reviewing series on the
mailing list at least every second day, and I always encourage other
folks in my team to do the same. But recently I (well, rather we)
haven't really been able to due to the current situation at GitLab,
which forces us to put almost all of our focus towards a different
project for a while.
Overall I agree that everyone who is a core contributor should also make
reviews part of their regular worflow. At least for corporate
contributors that might also make it easier to communicate this to their
respective employers. Regardless of that, my expectation is that there
will be times where it works well, and other times where it works less
well.
Patrick |
|
User |
|
This patch series was integrated into seen via git@db3b688. |
|
This patch series was integrated into seen via git@1587470. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Waiting for response(s) to review comment(s). cf. <xmqqldcy4f07.fsf@gitster.g> source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series was integrated into seen via git@86c54bb. |
|
This patch series was integrated into seen via git@ecefd6e. |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Waiting for response(s) to review comment(s). cf. <xmqqldcy4f07.fsf@gitster.g> source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Waiting for response(s) to review comment(s). cf. <xmqqldcy4f07.fsf@gitster.g> source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Waiting for response(s) to review comment(s). cf. <xmqqldcy4f07.fsf@gitster.g> source: <pull.2096.git.1776731171.gitgitgadget@gmail.com> |
|
This patch series is no longer integrated into seen. |
|
This patch series was integrated into seen via git@3565a90. |
|
Christian Couder wrote on the Git mailing list (how to reply to this email): On Tue, Jun 2, 2026 at 8:16 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jun 01, 2026 at 09:33:10PM +0900, Junio C Hamano wrote:
> > This is a fix to an important corner of our system, but somehow left
> > in "Needs review" state for much longer than I would have liked, so
> > even though I am officially on vacation ;-), I took some time to
> > read these through (by the way it was a pleasant read, thank you).
>
> Honestly, I always shy away from the merge-related subsystems. It has a
> lot of subtleties that I don't have any experience with, so I never
> really consider my input to be helpful here.
>
> > I wonder if we create a rule like
> >
> > Those of you who have more than 30 commits in our project are
> > expected to review one topic (or more) from other contributors
> > for every three patches you send and ask for reviews by others.
>
> Heh, that would make me condense patch series into fewer patches ;)
>
> > it would help balance the patch vs review ratio, perhaps?
>
> It's a good question. I typically try to aim for reviewing series on the
> mailing list at least every second day, and I always encourage other
> folks in my team to do the same. But recently I (well, rather we)
> haven't really been able to due to the current situation at GitLab,
> which forces us to put almost all of our focus towards a different
> project for a while.
>
> Overall I agree that everyone who is a core contributor should also make
> reviews part of their regular worflow. At least for corporate
> contributors that might also make it easier to communicate this to their
> respective employers. Regardless of that, my expectation is that there
> will be times where it works well, and other times where it works less
> well.
Sashiko (https://github.com/sashiko-dev/sashiko) is used these days by
Linux kernel developers and seems to work well for them.
At GitLab and probably in other companies, some of us also use AI to
review our work before sending it to the mailing list. And yeah, it
helps find issues before our patches reach the mailing list.
In the same way as we require that patches must pass CI, do we want to
require that patches "pass" an AI review before they get accepted?
The benefit would be that it would hopefully catch a lot of trivial
things like indentation, typos/grammos, etc, and a lot of things a bit
more difficult to spot like memory issues. Perhaps with some amount of
prompting/configuration (for example pointing it at our
CodingGuidelines and SubmittingPatches) it could also catch issues
like style issues, commits that do too many things, refactoring
opportunities, etc.
We would likely still require at least one human review (by someone
who is not the maintainer) to validate architectural decisions, to
make sure it goes in the same direction as other efforts, and perhaps
also to make sure that AI suggestions were properly handled by the
patch author.
If we decide to require it, then there are a lot of questions that we
will have to answer.
Do we want to have our own system somehow managed by us or would we be
happy to use existing systems already in place in some companies as
long as we can still tweak them in some ways, like the current CI
systems we use?
If we use existing systems likely at GitLab and GitHub, it might be
more difficult to get coherent results as they might use different
LLMs, but maybe it could help tighten our docs to make sure everyone
is aligned, and we could get better reviews by using multiple systems
because an LLM might find an issue that the other LLM missed.
Do we want an AI review right after a patch is posted or only if there
is no human review in the next X days?
Also what if the AI makes a long concrete suggestion to improve on the
patches? Could that be incompatible with our AI policy to apply it?
Should we try to prevent the AI from making such a suggestion in the
first place?
I haven't looked at how Sashiko is used for the kernel, but maybe
there will need to be some kinds of restrictions/authentications to
avoid potential abuse. |
|
User |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): Christian Couder <christian.couder@gmail.com> writes:
> On Tue, Jun 2, 2026 at 8:16 AM Patrick Steinhardt <ps@pks.im> wrote:
>>
>> Overall I agree that everyone who is a core contributor should also make
>> reviews part of their regular worflow. At least for corporate
>> contributors that might also make it easier to communicate this to their
>> respective employers. Regardless of that, my expectation is that there
>> will be times where it works well, and other times where it works less
>> well.
>
> Sashiko (https://github.com/sashiko-dev/sashiko) is used these days by
> Linux kernel developers and seems to work well for them.
>
> At GitLab and probably in other companies, some of us also use AI to
> review our work before sending it to the mailing list. And yeah, it
> helps find issues before our patches reach the mailing list.
>
> In the same way as we require that patches must pass CI, do we want to
> require that patches "pass" an AI review before they get accepted?
I do not think so. You (figuratively, not limited to Christian
Couder) are welcome to use whatever tool available to you to help
you polish your submission, and the higher quality your patches are
(e.g., fewer typos and jumps in logic flow that interferes the
thought process of human reviewers), the more helpful you are being
to the community. The use of GitHub PR initiated CI run falls into
the same category, I think, in that we do not require you to have an
account and trigger the CI there, but you are doing a good service
if you made sure you caught breakages on macOS you do not have
access to otherwise before sending your patches to the list.
But I do not think we should require you to bring your own token
budget to be able to contribute.
> The benefit would be that it would hopefully catch a lot of trivial
> things like indentation, typos/grammos, etc, and a lot of things a bit
> more difficult to spot like memory issues. Perhaps with some amount of
> prompting/configuration (for example pointing it at our
> CodingGuidelines and SubmittingPatches) it could also catch issues
> like style issues, commits that do too many things, refactoring
> opportunities, etc.
Yes.
Similarly, you are welcome to use tools including AI tools to help
you review others' patches, or help sanity check your reviews of
others' patches before you send them out. The reason why such an
effort is valuable to the community is the same.
But I personally consider that the use of the tools (not limited to
AI tools) is up to each developer. What counts a lot more is the
quality of the output. Just like PR driven CI at GitHub is offered
to everybody who wants to participate and is willing to have an
account there, it may help those aspiring developers if automated
review services are made easily available, but it is a different
story to _require_ use of such service.
|
traverse_trees_wrapper() saves entries from a first pass through traverse_trees() and then replays them through the real callback (collect_merge_info_callback). However, the replay loop silently discards the callback return value. This is not a deferred error; it is an ignored error. Today the only originator of a negative return in this entire call graph is traverse_trees()'s "exceeded maximum allowed tree depth" check; everything else (collect_merge_info_callback, traverse_trees_wrapper, the inner traverse_trees recursion) only relays that. So in current Git, the visible effect of dropping the replay callback's return value is narrow but bad: a tree nested past core.maxTreeDepth has its -1 swallowed, the subtree below the limit is silently pruned, and the merge completes as if that were the correct result. A later patch in this series will teach collect_merge_info_callback() to return -1 on an additional path -- detecting duplicate entries in malformed trees -- which is similarly handled today by just ignoring the problem (resulting in mostly a "last one wins" rule, though the non-last entry can mutate various state flags). Capture the return value, stop the loop on negative returns, and propagate the error to the caller. The callback returns a positive mask value on success, so normalize non-negative returns to 0 for the caller. Signed-off-by: Elijah Newren <newren@gmail.com>
collect_merge_info() has set info.show_all_errors = 1 since d2bc199 (merge-ort: implement a very basic collect_merge_info(), 2020-12-13). This setting was copied from unpack-trees.c where it controls batching of error messages for porcelain display, but merge-ort has no such error-batching logic and never needed it. With show_all_errors set, traverse_trees() captures a negative callback return but continues processing remaining entries rather than stopping immediately. Removing the setting restores the default behavior where a negative return from collect_merge_info_callback() breaks out of the traversal loop right away, allowing a future commit to exit early when a corrupt tree is detected. Signed-off-by: Elijah Newren <newren@gmail.com>
clear_or_reinit_internal_opts() is responsible for cleaning up the various data structures in merge_options_internal. It already handles many renames-related structures (dirs_removed, dir_renames, relevant_sources, cached_pairs, deferred, etc.) but does not free renames->pairs[].queue. In the normal code path, resolve_and_process_renames() frees pairs[s].queue and reinitializes it with diff_queue_init() before clear_or_reinit_internal_opts() runs, so the omission is harmless. However, if collect_merge_info() encounters an error and returns early (before resolve_and_process_renames() is ever called), any diff pairs already queued by collect_rename_info()/add_pair() will have their backing array leaked. Fix this by freeing renames->pairs[].queue in the cleanup function. In the normal path the pointer is already NULL (from the earlier diff_queue_init() in resolve_and_process_renames()), so free(NULL) is a safe no-op. Signed-off-by: Elijah Newren <newren@gmail.com>
Trees with duplicate entries are malformed; fsck reports "contains duplicate file entries" for them. merge-ort has from the beginning assumed that we would never hit such trees. It was written with the assumption that traverse_trees() calls collect_merge_info_callback() at most once per path. The "sanity checks" in that callback (added in d2bc199 (merge-ort: implement a very basic collect_merge_info(), 2020-12-13)) verify properties of each individual call but not that invariant. The strmap_put() in setup_path_info() silently overwrites the entry from any prior call for the same path, because it assumed there would be no other path. Unfortunately, supplemental data structures for various optimizations could still be tweaked before the extra paths were overwritten, and those data structures not matching expected state could trip various assertions. Change the return type of setup_path_info() from void to int to allow us to detect this case, and abort the merge with a clear error message when it occurs. Signed-off-by: Elijah Newren <newren@gmail.com>
verify_cache() checks that the index does not contain both "path" and
"path/file" before writing a tree. It does this by comparing only
adjacent entries, relying on the assumption that "path/file" would
immediately follow "path" in sorted order. Unfortunately, this
assumption does not always hold. For example:
docs <-- submodule entry
docs-internal/README.md <-- intervening entry
docs/requirements.txt <-- D/F conflict, NOT adjacent to "docs"
When this happens, verify_cache() silently misses the D/F conflict and
write-tree produces a corrupt tree object containing duplicate entries
(one for the submodule "docs" and one for the tree "docs").
I could not find any caller in current git that both allows the index to
get into this state and then tries to write it out without doing other
checks beyond the verify_cache() call in cache_tree_update(), but
verify_cache() is documented as a safety net for preventing corrupt
trees and should actually provide that guarantee. A downstream consumer
that relied solely on cache_tree_update()'s internal checking via
verify_cache() to prevent duplicate tree entries was bitten by the gap.
Add a test that constructs a corrupt index directly (bypassing the D/F
checks in add_index_entry) and verifies that write-tree now rejects it.
Signed-off-by: Elijah Newren <newren@gmail.com>
a87bbaa to
cf50f1a
Compare
|
User |
|
/submit |
|
Submitted as pull.2096.v2.git.1781419047.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
| @@ -161,6 +161,54 @@ void cache_tree_invalidate_path(struct index_state *istate, const char *path) | |||
| istate->cache_changed |= CACHE_TREE_CHANGED; | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> diff --git a/cache-tree.c b/cache-tree.c
> index 7881b42aa2..4d2669b312 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -161,6 +161,54 @@ void cache_tree_invalidate_path(struct index_state *istate, const char *path)
> istate->cache_changed |= CACHE_TREE_CHANGED;
> }
>
> +/*
> + * Check whether this_ce and the next entry in the index form a D/F
> + * conflict ("path" vs "path/file"). Returns the conflicting "path/..."
> + * name when one is found, or NULL otherwise.
> + *
> + * The cache is sorted, so "path/file" sorts after "path" and the
> + * conflict is usually visible as adjacent entries. But other entries
> + * can sort between them -- e.g. "path-internal" sits between "path"
> + * and "path/file" because '-' (0x2D) precedes '/' (0x2F) -- so when
> + * the immediately following entry shares our prefix but starts with a
> + * character that sorts before '/', binary search for "path/" instead.
> + */
> +static const char *find_df_conflict(struct index_state *istate,
> + const struct cache_entry *this_ce,
> + const struct cache_entry *next_ce)
> +{
> + const char *this_name = this_ce->name;
> + const char *next_name = next_ce->name;
> + int this_len = ce_namelen(this_ce);
> + const struct cache_entry *other;
> + struct strbuf probe = STRBUF_INIT;
> + int pos;
> + if (this_len >= ce_namelen(next_ce) ||
> + next_name[this_len] > '/' ||
> + strncmp(this_name, next_name, this_len))
> + return NULL;
First we reject an abvious "cannot conflict" case. If the current
one is not strictly shorter than the next one, it cannot be an
overlapping with one of the leading directories in the next one and
the next one cannot be "hiding" an overlapping one in the "docs,
docs-i, docs/r" relationship described in the log message (where
next=="docs-i' hides the fact that this=="docs" conflicts with
"docs/r"). Of course, the current one cannot be such a confliciting
entry, if an earlier part of the next name does not entirely match
the current name. In addition, if the byte after the matching
leading part in the next name is larger than '/', any entry that
comes later than the next name cannot have '/' at that byte position.
Makes sense.
> + if (next_name[this_len] == '/')
> + return next_name;
And if we see '/' there, we definitely have conflict right there.
So these trivial cases after us, we need to see if an entry that
comes after next has a name whose leading part is identical to this
name, plus '/'. How would we compute it?
> + strbuf_add(&probe, this_name, this_len);
> + strbuf_addch(&probe, '/');
> + pos = index_name_pos_sparse(istate, probe.buf, probe.len);
> + strbuf_release(&probe);
We see where the first of such a name (i.e. this name followed by '/')
may appear. Normal cache entries in the index would never have a
trailing slash in their names, so I expect that this would either
find an directory that is sparsed out and returns a non-negative
index into the .cache[] array, or a negative index that indicates
where a name like this + '/' + anything would appear.
> +
> + if (pos < 0)
> + pos = -pos - 1;
So in order to check both cases, we do the usual index flipping. We
do not need to to treat "it exists---that's sparse dir" case and "it
is expanded and we are trying to find the first entry in that index"
case differently below, which simplifies things a bit.
> + if (pos >= (int)istate->cache_nr)
> + return NULL;
OK, such an entry whose name is this followed by '/' does not exist
so we are safe.
> + other = istate->cache[pos];
> + if (ce_namelen(other) > this_len &&
> + other->name[this_len] == '/' &&
> + !strncmp(this_name, other->name, this_len))
> + return other->name;
> + return NULL;
This looks very similar to the inverse of the earlier one but
slightly different. It might be easier to spot where it differs, if
we flipped the polarity, like so, perhaps?
other = istate->cache[pos];
if (this_len >= ce_namelen(other) ||
other->name[this_len] != '/' ||
strncmp(this_name, other->name, this_len))
return NULL;
return other->name;
Or perhaps not. I do not care very strongly, but I only spotted the
subtle difference while attempting to flip the polarity in my head,
so it might help ther readers the same way. I dunno.
Thanks, looking very good.
|
There was a status update in the "Cooking" section about the branch "ort" merge backend handles merging corrupt trees better by aborting when it should. Waiting for response(s) to review comment(s). cf. <xmqq5x3ldu4h.fsf@gitster.g> source: <pull.2096.v2.git.1781419047.gitgitgadget@gmail.com> |
We had some corrupt trees with duplicate entries in real world repositories, which triggered an assertion failure in merge-ort. Further, the corrupt tree creation in the third party tool would have been avoided had verify_cache() correctly checked for D/F conflicts. Provide fixes for both issues, including 3 preparatory changes for the merge-ort fix.
Changes since v1:
cc: Patrick Steinhardt ps@pks.im
cc: Christian Couder christian.couder@gmail.com
cc: Elijah Newren newren@gmail.com