Skip to content

Duplicate entry hardening#2096

Open
newren wants to merge 5 commits into
gitgitgadget:masterfrom
newren:duplicate-entry-hardening
Open

Duplicate entry hardening#2096
newren wants to merge 5 commits into
gitgitgadget:masterfrom
newren:duplicate-entry-hardening

Conversation

@newren

@newren newren commented Apr 20, 2026

Copy link
Copy Markdown

We had some corrupt trees with duplicate entries in real world repositories, which triggered an assertion failure in merge-ort. Further, the corrupt tree creation in the third party tool would have been avoided had verify_cache() correctly checked for D/F conflicts. Provide fixes for both issues, including 3 preparatory changes for the merge-ort fix.

Changes since v1:

  • Add some more detail to the commit message of Patch 1
  • Split some code out in Patch 5 into a helper function.

cc: Patrick Steinhardt ps@pks.im
cc: Christian Couder christian.couder@gmail.com
cc: Elijah Newren newren@gmail.com

@newren

newren commented Apr 21, 2026

Copy link
Copy Markdown
Author

/submit

@gitgitgadget

gitgitgadget Bot commented Apr 21, 2026

Copy link
Copy Markdown

Submitted as pull.2096.git.1776731171.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2096/newren/duplicate-entry-hardening-v1

To fetch this version to local tag pr-2096/newren/duplicate-entry-hardening-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2096/newren/duplicate-entry-hardening-v1

@gitgitgadget

gitgitgadget Bot commented Apr 26, 2026

Copy link
Copy Markdown

This branch is now known as en/ort-harden-against-corrupt-trees.

@gitgitgadget

gitgitgadget Bot commented Apr 26, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@5ceb988.

@gitgitgadget gitgitgadget Bot added the seen label Apr 26, 2026
@gitgitgadget

gitgitgadget Bot commented Apr 28, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@b2979e0.

@gitgitgadget

gitgitgadget Bot commented May 1, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@5a8e1e0.

@gitgitgadget

gitgitgadget Bot commented May 3, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Needs review.
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented May 4, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@62d334b.

@gitgitgadget

gitgitgadget Bot commented May 9, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@6099d30.

@gitgitgadget

gitgitgadget Bot commented May 9, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@1d317d6.

@gitgitgadget

gitgitgadget Bot commented May 11, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@a1e3c69.

@gitgitgadget

gitgitgadget Bot commented May 11, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Needs review.
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented May 11, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@b7807b6.

@gitgitgadget

gitgitgadget Bot commented May 11, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@8cb15cc.

@gitgitgadget

gitgitgadget Bot commented May 12, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@a090bdb.

@gitgitgadget

gitgitgadget Bot commented May 12, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Needs review.
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented May 13, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@64f0aa6.

@gitgitgadget

gitgitgadget Bot commented May 14, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@1300a5a.

@gitgitgadget

gitgitgadget Bot commented May 15, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@b3f3685.

@gitgitgadget

gitgitgadget Bot commented May 18, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@fc35baf.

@gitgitgadget

gitgitgadget Bot commented May 18, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Needs review.
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented May 19, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@5edc804.

@gitgitgadget

gitgitgadget Bot commented May 20, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@a510b62.

@gitgitgadget

gitgitgadget Bot commented May 20, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Needs review.
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented May 21, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@3924c8d.

@gitgitgadget

gitgitgadget Bot commented May 21, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@4944388.

@gitgitgadget

gitgitgadget Bot commented May 21, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@c7d9b18.

@gitgitgadget

gitgitgadget Bot commented May 23, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@798f6ed.

@gitgitgadget

gitgitgadget Bot commented May 23, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Needs review.
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented Jun 1, 2026

Copy link
Copy Markdown

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> We had some corrupt trees with duplicate entries in real world repositories,
> which triggered an assertion failure in merge-ort. Further, the corrupt tree
> creation in the third party tool would have been avoided had verify_cache()
> correctly checked for D/F conflicts. Provide fixes for both issues,
> including 3 preparatory changes for the merge-ort fix.
>
> Elijah Newren (5):
>   merge-ort: propagate callback errors from traverse_trees_wrapper()
>   merge-ort: drop unnecessary show_all_errors from collect_merge_info()
>   merge-ort: free diff pairs queue in clear_or_reinit_internal_opts()
>   merge-ort: abort merge when trees have duplicate entries
>   cache-tree: fix verify_cache() to catch non-adjacent D/F conflicts

This is a fix to an important corner of our system, but somehow left
in "Needs review" state for much longer than I would have liked, so
even though I am officially on vacation ;-), I took some time to
read these through (by the way it was a pleasant read, thank you).

I wonder if we create a rule like

    Those of you who have more than 30 commits in our project are
    expected to review one topic (or more) from other contributors
    for every three patches you send and ask for reviews by others.

it would help balance the patch vs review ratio, perhaps?

Comment thread cache-tree.c
@gitgitgadget

gitgitgadget Bot commented Jun 1, 2026

Copy link
Copy Markdown

Patrick Steinhardt wrote on the Git mailing list (how to reply to this email):

On Mon, Jun 01, 2026 at 09:33:10PM +0900, Junio C Hamano wrote:
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > We had some corrupt trees with duplicate entries in real world repositories,
> > which triggered an assertion failure in merge-ort. Further, the corrupt tree
> > creation in the third party tool would have been avoided had verify_cache()
> > correctly checked for D/F conflicts. Provide fixes for both issues,
> > including 3 preparatory changes for the merge-ort fix.
> >
> > Elijah Newren (5):
> >   merge-ort: propagate callback errors from traverse_trees_wrapper()
> >   merge-ort: drop unnecessary show_all_errors from collect_merge_info()
> >   merge-ort: free diff pairs queue in clear_or_reinit_internal_opts()
> >   merge-ort: abort merge when trees have duplicate entries
> >   cache-tree: fix verify_cache() to catch non-adjacent D/F conflicts
> 
> This is a fix to an important corner of our system, but somehow left
> in "Needs review" state for much longer than I would have liked, so
> even though I am officially on vacation ;-), I took some time to
> read these through (by the way it was a pleasant read, thank you).

Honestly, I always shy away from the merge-related subsystems. It has a
lot of subtleties that I don't have any experience with, so I never
really consider my input to be helpful here.

> I wonder if we create a rule like
> 
>     Those of you who have more than 30 commits in our project are
>     expected to review one topic (or more) from other contributors
>     for every three patches you send and ask for reviews by others.

Heh, that would make me condense patch series into fewer patches ;)

> it would help balance the patch vs review ratio, perhaps?

It's a good question. I typically try to aim for reviewing series on the
mailing list at least every second day, and I always encourage other
folks in my team to do the same. But recently I (well, rather we)
haven't really been able to due to the current situation at GitLab,
which forces us to put almost all of our focus towards a different
project for a while.

Overall I agree that everyone who is a core contributor should also make
reviews part of their regular worflow. At least for corporate
contributors that might also make it easier to communicate this to their
respective employers. Regardless of that, my expectation is that there
will be times where it works well, and other times where it works less
well.

Patrick

@gitgitgadget

gitgitgadget Bot commented Jun 1, 2026

Copy link
Copy Markdown

User Patrick Steinhardt <ps@pks.im> has been added to the cc: list.

@gitgitgadget

gitgitgadget Bot commented Jun 2, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@db3b688.

@gitgitgadget

gitgitgadget Bot commented Jun 2, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@1587470.

@gitgitgadget

gitgitgadget Bot commented Jun 2, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Waiting for response(s) to review comment(s).
cf. <xmqqldcy4f07.fsf@gitster.g>
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented Jun 3, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@86c54bb.

@gitgitgadget

gitgitgadget Bot commented Jun 4, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@ecefd6e.

@gitgitgadget

gitgitgadget Bot commented Jun 4, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Waiting for response(s) to review comment(s).
cf. <xmqqldcy4f07.fsf@gitster.g>
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented Jun 9, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Waiting for response(s) to review comment(s).
cf. <xmqqldcy4f07.fsf@gitster.g>
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented Jun 11, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Waiting for response(s) to review comment(s).
cf. <xmqqldcy4f07.fsf@gitster.g>
source: <pull.2096.git.1776731171.gitgitgadget@gmail.com>

@gitgitgadget

gitgitgadget Bot commented Jun 12, 2026

Copy link
Copy Markdown

This patch series is no longer integrated into seen.

@gitgitgadget gitgitgadget Bot removed the seen label Jun 12, 2026
@gitgitgadget

gitgitgadget Bot commented Jun 12, 2026

Copy link
Copy Markdown

This patch series was integrated into seen via git@3565a90.

@gitgitgadget gitgitgadget Bot added the seen label Jun 12, 2026
@gitgitgadget

gitgitgadget Bot commented Jun 12, 2026

Copy link
Copy Markdown

Christian Couder wrote on the Git mailing list (how to reply to this email):

On Tue, Jun 2, 2026 at 8:16 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Mon, Jun 01, 2026 at 09:33:10PM +0900, Junio C Hamano wrote:

> > This is a fix to an important corner of our system, but somehow left
> > in "Needs review" state for much longer than I would have liked, so
> > even though I am officially on vacation ;-), I took some time to
> > read these through (by the way it was a pleasant read, thank you).
>
> Honestly, I always shy away from the merge-related subsystems. It has a
> lot of subtleties that I don't have any experience with, so I never
> really consider my input to be helpful here.
>
> > I wonder if we create a rule like
> >
> >     Those of you who have more than 30 commits in our project are
> >     expected to review one topic (or more) from other contributors
> >     for every three patches you send and ask for reviews by others.
>
> Heh, that would make me condense patch series into fewer patches ;)
>
> > it would help balance the patch vs review ratio, perhaps?
>
> It's a good question. I typically try to aim for reviewing series on the
> mailing list at least every second day, and I always encourage other
> folks in my team to do the same. But recently I (well, rather we)
> haven't really been able to due to the current situation at GitLab,
> which forces us to put almost all of our focus towards a different
> project for a while.
>
> Overall I agree that everyone who is a core contributor should also make
> reviews part of their regular worflow. At least for corporate
> contributors that might also make it easier to communicate this to their
> respective employers. Regardless of that, my expectation is that there
> will be times where it works well, and other times where it works less
> well.

Sashiko (https://github.com/sashiko-dev/sashiko) is used these days by
Linux kernel developers and seems to work well for them.

At GitLab and probably in other companies, some of us also use AI to
review our work before sending it to the mailing list. And yeah, it
helps find issues before our patches reach the mailing list.

In the same way as we require that patches must pass CI, do we want to
require that patches "pass" an AI review before they get accepted?

The benefit would be that it would hopefully catch a lot of trivial
things like indentation, typos/grammos, etc, and a lot of things a bit
more difficult to spot like memory issues. Perhaps with some amount of
prompting/configuration (for example pointing it at our
CodingGuidelines and SubmittingPatches) it could also catch issues
like style issues, commits that do too many things, refactoring
opportunities, etc.

We would likely still require at least one human review (by someone
who is not the maintainer) to validate architectural decisions, to
make sure it goes in the same direction as other efforts, and perhaps
also to make sure that AI suggestions were properly handled by the
patch author.

If we decide to require it, then there are a lot of questions that we
will have to answer.

Do we want to have our own system somehow managed by us or would we be
happy to use existing systems already in place in some companies as
long as we can still tweak them in some ways, like the current CI
systems we use?

If we use existing systems likely at GitLab and GitHub, it might be
more difficult to get coherent results as they might use different
LLMs, but maybe it could help tighten our docs to make sure everyone
is aligned, and we could get better reviews by using multiple systems
because an LLM might find an issue that the other LLM missed.

Do we want an AI review right after a patch is posted or only if there
is no human review in the next X days?

Also what if the AI makes a long concrete suggestion to improve on the
patches? Could that be incompatible with our AI policy to apply it?
Should we try to prevent the AI from making such a suggestion in the
first place?

I haven't looked at how Sashiko is used for the kernel, but maybe
there will need to be some kinds of restrictions/authentications to
avoid potential abuse.

@gitgitgadget

gitgitgadget Bot commented Jun 12, 2026

Copy link
Copy Markdown

User Christian Couder <christian.couder@gmail.com> has been added to the cc: list.

@gitgitgadget

gitgitgadget Bot commented Jun 12, 2026

Copy link
Copy Markdown

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Christian Couder <christian.couder@gmail.com> writes:

> On Tue, Jun 2, 2026 at 8:16 AM Patrick Steinhardt <ps@pks.im> wrote:
>>
>> Overall I agree that everyone who is a core contributor should also make
>> reviews part of their regular worflow. At least for corporate
>> contributors that might also make it easier to communicate this to their
>> respective employers. Regardless of that, my expectation is that there
>> will be times where it works well, and other times where it works less
>> well.
>
> Sashiko (https://github.com/sashiko-dev/sashiko) is used these days by
> Linux kernel developers and seems to work well for them.
>
> At GitLab and probably in other companies, some of us also use AI to
> review our work before sending it to the mailing list. And yeah, it
> helps find issues before our patches reach the mailing list.
>
> In the same way as we require that patches must pass CI, do we want to
> require that patches "pass" an AI review before they get accepted?

I do not think so.  You (figuratively, not limited to Christian
Couder) are welcome to use whatever tool available to you to help
you polish your submission, and the higher quality your patches are
(e.g., fewer typos and jumps in logic flow that interferes the
thought process of human reviewers), the more helpful you are being
to the community.  The use of GitHub PR initiated CI run falls into
the same category, I think, in that we do not require you to have an
account and trigger the CI there, but you are doing a good service
if you made sure you caught breakages on macOS you do not have
access to otherwise before sending your patches to the list.

But I do not think we should require you to bring your own token
budget to be able to contribute.

> The benefit would be that it would hopefully catch a lot of trivial
> things like indentation, typos/grammos, etc, and a lot of things a bit
> more difficult to spot like memory issues. Perhaps with some amount of
> prompting/configuration (for example pointing it at our
> CodingGuidelines and SubmittingPatches) it could also catch issues
> like style issues, commits that do too many things, refactoring
> opportunities, etc.

Yes.

Similarly, you are welcome to use tools including AI tools to help
you review others' patches, or help sanity check your reviews of
others' patches before you send them out.  The reason why such an
effort is valuable to the community is the same.

But I personally consider that the use of the tools (not limited to
AI tools) is up to each developer.  What counts a lot more is the
quality of the output.  Just like PR driven CI at GitHub is offered
to everybody who wants to participate and is willing to have an
account there, it may help those aspiring developers if automated
review services are made easily available, but it is a different
story to _require_ use of such service.

newren added 5 commits June 13, 2026 20:26
traverse_trees_wrapper() saves entries from a first pass through
traverse_trees() and then replays them through the real callback
(collect_merge_info_callback).  However, the replay loop silently
discards the callback return value.  This is not a deferred error;
it is an ignored error.

Today the only originator of a negative return in this entire call
graph is traverse_trees()'s "exceeded maximum allowed tree depth"
check; everything else (collect_merge_info_callback,
traverse_trees_wrapper, the inner traverse_trees recursion) only
relays that.  So in current Git, the visible effect of dropping the
replay callback's return value is narrow but bad: a tree nested past
core.maxTreeDepth has its -1 swallowed, the subtree below the limit
is silently pruned, and the merge completes as if that were the
correct result.

A later patch in this series will teach collect_merge_info_callback()
to return -1 on an additional path -- detecting duplicate
entries in malformed trees -- which is similarly handled today by
just ignoring the problem (resulting in mostly a "last one wins" rule,
though the non-last entry can mutate various state flags).

Capture the return value, stop the loop on negative returns, and
propagate the error to the caller.  The callback returns a positive mask
value on success, so normalize non-negative returns to
0 for the caller.

Signed-off-by: Elijah Newren <newren@gmail.com>
collect_merge_info() has set info.show_all_errors = 1 since
d2bc199 (merge-ort: implement a very basic collect_merge_info(),
2020-12-13).  This setting was copied from unpack-trees.c where it
controls batching of error messages for porcelain display, but
merge-ort has no such error-batching logic and never needed it.

With show_all_errors set, traverse_trees() captures a negative callback
return but continues processing remaining entries rather than stopping
immediately.  Removing the setting restores the default behavior where
a negative return from collect_merge_info_callback() breaks out of the
traversal loop right away, allowing a future commit to exit early when
a corrupt tree is detected.

Signed-off-by: Elijah Newren <newren@gmail.com>
clear_or_reinit_internal_opts() is responsible for cleaning up the
various data structures in merge_options_internal.  It already handles
many renames-related structures (dirs_removed, dir_renames,
relevant_sources, cached_pairs, deferred, etc.) but does not free
renames->pairs[].queue.

In the normal code path, resolve_and_process_renames() frees
pairs[s].queue and reinitializes it with diff_queue_init() before
clear_or_reinit_internal_opts() runs, so the omission is harmless.
However, if collect_merge_info() encounters an error and returns early
(before resolve_and_process_renames() is ever called), any diff pairs
already queued by collect_rename_info()/add_pair() will have their
backing array leaked.

Fix this by freeing renames->pairs[].queue in the cleanup function.
In the normal path the pointer is already NULL (from the earlier
diff_queue_init() in resolve_and_process_renames()), so free(NULL) is
a safe no-op.

Signed-off-by: Elijah Newren <newren@gmail.com>
Trees with duplicate entries are malformed; fsck reports "contains
duplicate file entries" for them.  merge-ort has from the beginning
assumed that we would never hit such trees.  It was written with the
assumption that traverse_trees() calls collect_merge_info_callback() at
most once per path.  The "sanity checks" in that callback (added in
d2bc199 (merge-ort: implement a very basic collect_merge_info(),
2020-12-13)) verify properties of each individual call but not that
invariant.  The strmap_put() in setup_path_info() silently overwrites
the entry from any prior call for the same path, because it assumed
there would be no other path.  Unfortunately, supplemental data
structures for various optimizations could still be tweaked before the
extra paths were overwritten, and those data structures not matching
expected state could trip various assertions.

Change the return type of setup_path_info() from void to int to allow us
to detect this case, and abort the merge with a clear error message when
it occurs.

Signed-off-by: Elijah Newren <newren@gmail.com>
verify_cache() checks that the index does not contain both "path" and
"path/file" before writing a tree.  It does this by comparing only
adjacent entries, relying on the assumption that "path/file" would
immediately follow "path" in sorted order.  Unfortunately, this
assumption does not always hold.  For example:

    docs                     <-- submodule entry
    docs-internal/README.md  <-- intervening entry
    docs/requirements.txt    <-- D/F conflict, NOT adjacent to "docs"

When this happens, verify_cache() silently misses the D/F conflict and
write-tree produces a corrupt tree object containing duplicate entries
(one for the submodule "docs" and one for the tree "docs").

I could not find any caller in current git that both allows the index to
get into this state and then tries to write it out without doing other
checks beyond the verify_cache() call in cache_tree_update(), but
verify_cache() is documented as a safety net for preventing corrupt
trees and should actually provide that guarantee.  A downstream consumer
that relied solely on cache_tree_update()'s internal checking via
verify_cache() to prevent duplicate tree entries was bitten by the gap.

Add a test that constructs a corrupt index directly (bypassing the D/F
checks in add_index_entry) and verifies that write-tree now rejects it.

Signed-off-by: Elijah Newren <newren@gmail.com>
@newren newren force-pushed the duplicate-entry-hardening branch from a87bbaa to cf50f1a Compare June 14, 2026 03:54
@gitgitgadget

gitgitgadget Bot commented Jun 14, 2026

Copy link
Copy Markdown

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

@newren

newren commented Jun 14, 2026

Copy link
Copy Markdown
Author

/submit

@gitgitgadget

gitgitgadget Bot commented Jun 14, 2026

Copy link
Copy Markdown

Submitted as pull.2096.v2.git.1781419047.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2096/newren/duplicate-entry-hardening-v2

To fetch this version to local tag pr-2096/newren/duplicate-entry-hardening-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2096/newren/duplicate-entry-hardening-v2

Comment thread cache-tree.c
@@ -161,6 +161,54 @@ void cache_tree_invalidate_path(struct index_state *istate, const char *path)
istate->cache_changed |= CACHE_TREE_CHANGED;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/cache-tree.c b/cache-tree.c
> index 7881b42aa2..4d2669b312 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -161,6 +161,54 @@ void cache_tree_invalidate_path(struct index_state *istate, const char *path)
>  		istate->cache_changed |= CACHE_TREE_CHANGED;
>  }
>  
> +/*
> + * Check whether this_ce and the next entry in the index form a D/F
> + * conflict ("path" vs "path/file").  Returns the conflicting "path/..."
> + * name when one is found, or NULL otherwise.
> + *
> + * The cache is sorted, so "path/file" sorts after "path" and the
> + * conflict is usually visible as adjacent entries.  But other entries
> + * can sort between them -- e.g. "path-internal" sits between "path"
> + * and "path/file" because '-' (0x2D) precedes '/' (0x2F) -- so when
> + * the immediately following entry shares our prefix but starts with a
> + * character that sorts before '/', binary search for "path/" instead.
> + */
> +static const char *find_df_conflict(struct index_state *istate,
> +				    const struct cache_entry *this_ce,
> +				    const struct cache_entry *next_ce)
> +{
> +	const char *this_name = this_ce->name;
> +	const char *next_name = next_ce->name;
> +	int this_len = ce_namelen(this_ce);
> +	const struct cache_entry *other;
> +	struct strbuf probe = STRBUF_INIT;
> +	int pos;

> +	if (this_len >= ce_namelen(next_ce) ||
> +	    next_name[this_len] > '/' ||
> +	    strncmp(this_name, next_name, this_len))
> +		return NULL;

First we reject an abvious "cannot conflict" case. If the current
one is not strictly shorter than the next one, it cannot be an
overlapping with one of the leading directories in the next one and
the next one cannot be "hiding" an overlapping one in the "docs,
docs-i, docs/r" relationship described in the log message (where
next=="docs-i' hides the fact that this=="docs" conflicts with
"docs/r").  Of course, the current one cannot be such a confliciting
entry, if an earlier part of the next name does not entirely match
the current name.  In addition, if the byte after the matching
leading part in the next name is larger than '/', any entry that
comes later than the next name cannot have '/' at that byte position.

Makes sense.

> +	if (next_name[this_len] == '/')
> +		return next_name;

And if we see '/' there, we definitely have conflict right there.

So these trivial cases after us, we need to see if an entry that
comes after next has a name whose leading part is identical to this
name, plus '/'.  How would we compute it?

> +	strbuf_add(&probe, this_name, this_len);
> +	strbuf_addch(&probe, '/');
> +	pos = index_name_pos_sparse(istate, probe.buf, probe.len);
> +	strbuf_release(&probe);

We see where the first of such a name (i.e. this name followed by '/')
may appear.  Normal cache entries in the index would never have a
trailing slash in their names, so I expect that this would either
find an directory that is sparsed out and returns a non-negative
index into the .cache[] array, or a negative index that indicates
where a name like this + '/' + anything would appear.

> +
> +	if (pos < 0)
> +		pos = -pos - 1;

So in order to check both cases, we do the usual index flipping.  We
do not need to to treat "it exists---that's sparse dir" case and "it
is expanded and we are trying to find the first entry in that index"
case differently below, which simplifies things a bit.

> +	if (pos >= (int)istate->cache_nr)
> +		return NULL;

OK, such an entry whose name is this followed by '/' does not exist
so we are safe.

> +	other = istate->cache[pos];
> +	if (ce_namelen(other) > this_len &&
> +	    other->name[this_len] == '/' &&
> +	    !strncmp(this_name, other->name, this_len))
> +		return other->name;
> +	return NULL;

This looks very similar to the inverse of the earlier one but
slightly different.  It might be easier to spot where it differs, if
we flipped the polarity, like so, perhaps?

	other = istate->cache[pos];
	if (this_len >= ce_namelen(other) ||
	    other->name[this_len] != '/' ||
	    strncmp(this_name, other->name, this_len))
		return NULL;

	return other->name;

Or perhaps not.  I do not care very strongly, but I only spotted the
subtle difference while attempting to flip the polarity in my head,
so it might help ther readers the same way.  I dunno.

Thanks, looking very good.

@gitgitgadget

gitgitgadget Bot commented Jun 15, 2026

Copy link
Copy Markdown

There was a status update in the "Cooking" section about the branch en/ort-harden-against-corrupt-trees on the Git mailing list:

"ort" merge backend handles merging corrupt trees better by
aborting when it should.

Waiting for response(s) to review comment(s).
cf. <xmqq5x3ldu4h.fsf@gitster.g>
source: <pull.2096.v2.git.1781419047.gitgitgadget@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant