Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Documentation/config/diff.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,11 @@ endif::git-diff[]
Set this option to `true` to make the diff driver cache the text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +struct diff_subprocess {
> +	struct subprocess_entry subprocess;
> +	unsigned int supported_capabilities;
> +};
> +
> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;

Can we avoid introducing new global variables like these?  Would
"struct userdiff_driver" or "struct diff_options" be a good place to
hang this hashmap, perhaps?

> +static int send_file_content(int fd, const char *buf, long size)
> +{
> +	int ret;
> +
> +	if (size > 0)
> +		ret = write_packetized_from_buf_no_flush(buf, size, fd);
> +	else
> +		ret = 0;

Shouldn't "size == -24" be flagged as an invalid input?

> +	if (ret)
> +		return ret;
> +	return packet_flush_gently(fd);
> +}

> +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk)
> +{
> +...
> +}

This gives a silent error diagnosis, which is good for a lower level
helper.

> +int diff_process_get_hunks(struct userdiff_driver *drv,
> +			   const char *path,
> +			   const char *old_buf, long old_size,
> +			   const char *new_buf, long new_size,
> +			   struct xdl_hunk **hunks_out,
> +			   size_t *nr_hunks_out)
> +{
> +	struct diff_subprocess *backend;
> +	struct child_process *process;
> +	int fd_in, fd_out;
> +	struct strbuf status = STRBUF_INIT;
> +	struct xdl_hunk *hunks = NULL;
> +	struct xdl_hunk hunk;
> +	size_t nr_hunks = 0, alloc_hunks = 0;
> +	int len;
> +	char *line;
> +
> +	if (!drv || !drv->process)
> +		return -1;

A driver that does not define process is not an error; it is
perfectly normal in the current world order where nobody has such an
external process and even fi this patch lands, external processes
are optional.  So here "return -1" does not mean an error, and
silent return is perfectly fine.

> +	backend = find_or_start_process(drv->process);
> +	if (!backend)
> +		return -1;

This is probably an error; the user specified drv->process, we
either tried to find or start the process and failed.  Isn't it an
event that deserves to be reported in an error message?

> +	if (!(backend->supported_capabilities & CAP_HUNKS))
> +		return -1;

Backend started, but the "hunks" feature is not supported.  Perhaps
in a year or two, this external process protocol may have become so
popular that it gained more capabilities, possibly making get_hunks
obsolete.  We may be looking at such an external process that uses
other capabilities but not this one.  This is not an error, so
silent return is perfectly fine.

> +	process = subprocess_get_child_process(&backend->subprocess);
> +	fd_in = process->in;
> +	fd_out = process->out;
> +
> +	/* Send request */
> +	if (packet_write_fmt_gently(fd_in, "command=hunks\n") ||
> +	    packet_write_fmt_gently(fd_in, "pathname=%s\n", path) ||
> +	    packet_flush_gently(fd_in))
> +		goto error;
> +
> +	/* Send old file content */
> +	if (send_file_content(fd_in, old_buf, old_size))
> +		goto error;
> +
> +	/* Send new file content */
> +	if (send_file_content(fd_in, new_buf, new_size))
> +		goto error;
> +
> +	/* Read hunks until flush packet */
> +	while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 &&
> +	       line) {
> +		if (parse_hunk_line(line, &hunk) < 0)
> +			goto error;
> +		ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks);
> +		hunks[nr_hunks++] = hunk;
> +	}
> +	if (len < 0)
> +		goto error;
> +
> +	/* Read status */
> +	if (subprocess_read_status(fd_out, &status))
> +		goto error;
> +
> +	if (strcmp(status.buf, "success")) {
> +		if (!strcmp(status.buf, "abort"))
> +			backend->supported_capabilities &= ~CAP_HUNKS;
> +		goto error;
> +	}
> +
> +	*hunks_out = hunks;
> +	*nr_hunks_out = nr_hunks;
> +	strbuf_release(&status);
> +	return 0;
> +
> +error:

All exceptions that lead here look like events that should be
reported to the end-user.

> +	free(hunks);
> +	strbuf_release(&status);
> +	return -1;
> +}

> +/*
> + * Query a diff process for hunks describing the changes
> + * between old_buf and new_buf.
> + *
> + * The backend is a long-running subprocess configured via
> + * diff.<driver>.process.  It receives file content via
> + * pkt-line and returns hunks with 1-based line numbers.
> + *
> + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated
> + * array (caller must free) and returns 0.
> + *
> + * On failure, returns -1.  The caller should fall back to the
> + * builtin diff algorithm.
> + */

I do not agree with this.  If it is a failure, the user should fix
the external process (or disable).  It shouldn't be hidden behind a
fallback.  As I left comments, in this round of implementation,
there are conditions that returns -1 for soemthing that is not an
error (i.e., not configured, or process not supporting the
particular capability) *and* in those cases the caller should fall
back as if nothing happened.  But some error cases, the caller
should't hide them.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Mon, May 25, 2026 at 6:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +struct diff_subprocess {
> > +     struct subprocess_entry subprocess;
> > +     unsigned int supported_capabilities;
> > +};
> > +
> > +static int subprocess_map_initialized;
> > +static struct hashmap subprocess_map;
>
> Can we avoid introducing new global variables like these?  Would
> "struct userdiff_driver" or "struct diff_options" be a good place to
> hang this hashmap, perhaps?
>

Will clean this up.

> > +static int send_file_content(int fd, const char *buf, long size)
> > +{
> > +     int ret;
> > +
> > +     if (size > 0)
> > +             ret = write_packetized_from_buf_no_flush(buf, size, fd);
> > +     else
> > +             ret = 0;
>
> Shouldn't "size == -24" be flagged as an invalid input?
>

Will fix and do a broader audit of input validation and bounds checking.

> > +     if (ret)
> > +             return ret;
> > +     return packet_flush_gently(fd);
> > +}
>
> > +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk)
> > +{
> > +...
> > +}
>
> This gives a silent error diagnosis, which is good for a lower level
> helper.
>
> > +int diff_process_get_hunks(struct userdiff_driver *drv,
> > +                        const char *path,
> > +                        const char *old_buf, long old_size,
> > +                        const char *new_buf, long new_size,
> > +                        struct xdl_hunk **hunks_out,
> > +                        size_t *nr_hunks_out)
> > +{
> > +     struct diff_subprocess *backend;
> > +     struct child_process *process;
> > +     int fd_in, fd_out;
> > +     struct strbuf status = STRBUF_INIT;
> > +     struct xdl_hunk *hunks = NULL;
> > +     struct xdl_hunk hunk;
> > +     size_t nr_hunks = 0, alloc_hunks = 0;
> > +     int len;
> > +     char *line;
> > +
> > +     if (!drv || !drv->process)
> > +             return -1;
>
> A driver that does not define process is not an error; it is
> perfectly normal in the current world order where nobody has such an
> external process and even fi this patch lands, external processes
> are optional.  So here "return -1" does not mean an error, and
> silent return is perfectly fine.
>
> > +     backend = find_or_start_process(drv->process);
> > +     if (!backend)
> > +             return -1;
>
> This is probably an error; the user specified drv->process, we
> either tried to find or start the process and failed.  Isn't it an
> event that deserves to be reported in an error message?
>
> > +     if (!(backend->supported_capabilities & CAP_HUNKS))
> > +             return -1;
>
> Backend started, but the "hunks" feature is not supported.  Perhaps
> in a year or two, this external process protocol may have become so
> popular that it gained more capabilities, possibly making get_hunks
> obsolete.  We may be looking at such an external process that uses
> other capabilities but not this one.  This is not an error, so
> silent return is perfectly fine.
>
> > +     process = subprocess_get_child_process(&backend->subprocess);
> > +     fd_in = process->in;
> > +     fd_out = process->out;
> > +
> > +     /* Send request */
> > +     if (packet_write_fmt_gently(fd_in, "command=hunks\n") ||
> > +         packet_write_fmt_gently(fd_in, "pathname=%s\n", path) ||
> > +         packet_flush_gently(fd_in))
> > +             goto error;
> > +
> > +     /* Send old file content */
> > +     if (send_file_content(fd_in, old_buf, old_size))
> > +             goto error;
> > +
> > +     /* Send new file content */
> > +     if (send_file_content(fd_in, new_buf, new_size))
> > +             goto error;
> > +
> > +     /* Read hunks until flush packet */
> > +     while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 &&
> > +            line) {
> > +             if (parse_hunk_line(line, &hunk) < 0)
> > +                     goto error;
> > +             ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks);
> > +             hunks[nr_hunks++] = hunk;
> > +     }
> > +     if (len < 0)
> > +             goto error;
> > +
> > +     /* Read status */
> > +     if (subprocess_read_status(fd_out, &status))
> > +             goto error;
> > +
> > +     if (strcmp(status.buf, "success")) {
> > +             if (!strcmp(status.buf, "abort"))
> > +                     backend->supported_capabilities &= ~CAP_HUNKS;
> > +             goto error;
> > +     }
> > +
> > +     *hunks_out = hunks;
> > +     *nr_hunks_out = nr_hunks;
> > +     strbuf_release(&status);
> > +     return 0;
> > +
> > +error:
>
> All exceptions that lead here look like events that should be
> reported to the end-user.
>

Agreed on all points. I will restructure things so errors are flagged when
appropriate (i.e., user specified a process but one was not found / couldn't
start and exceptions) and non-errors are treated as they should be.

> > +     free(hunks);
> > +     strbuf_release(&status);
> > +     return -1;
> > +}
>
> > +/*
> > + * Query a diff process for hunks describing the changes
> > + * between old_buf and new_buf.
> > + *
> > + * The backend is a long-running subprocess configured via
> > + * diff.<driver>.process.  It receives file content via
> > + * pkt-line and returns hunks with 1-based line numbers.
> > + *
> > + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated
> > + * array (caller must free) and returns 0.
> > + *
> > + * On failure, returns -1.  The caller should fall back to the
> > + * builtin diff algorithm.
> > + */
>
> I do not agree with this.  If it is a failure, the user should fix
> the external process (or disable).  It shouldn't be hidden behind a
> fallback.  As I left comments, in this round of implementation,
> there are conditions that returns -1 for soemthing that is not an
> error (i.e., not configured, or process not supporting the
> particular capability) *and* in those cases the caller should fall
> back as if nothing happened.  But some error cases, the caller
> should't hide them.

Will address in a follow-up.

Thank you for the feedback!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Zero hunks with status=success means the tool considers the
> files equivalent.  Git skips diff output for that file.

Is "zero hunk" a common word or some random string you invented?  If
the latter, which is I am assuming it to be, you should define what
it means at/before the first use.  Here in the proposed log message,
and ...

>
> Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
> ---
>  Documentation/config/diff.adoc   |   8 +
>  Documentation/gitattributes.adoc |  40 ++++
>  Makefile                         |   1 +
>  diff-process.c                   | 206 +++++++++++++++++++
>  diff-process.h                   |  28 +++
>  diff.c                           |  23 +++
>  t/.gitattributes                 |   1 +
>  t/t4080-diff-process.sh          | 338 +++++++++++++++++++++++++++++++
>  8 files changed, 645 insertions(+)
>  create mode 100644 diff-process.c
>  create mode 100644 diff-process.h
>  create mode 100755 t/t4080-diff-process.sh
>
> diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc
> index 1135a62a0a..4ab5f60df6 100644
> --- a/Documentation/config/diff.adoc
> +++ b/Documentation/config/diff.adoc
> @@ -218,6 +218,14 @@ endif::git-diff[]
>  	Set this option to `true` to make the diff driver cache the text
>  	conversion outputs.  See linkgit:gitattributes[5] for details.
>  
> +`diff.<driver>.process`::
> +	The command to run as a long-running diff process.
> +	The tool communicates via the pkt-line protocol and returns
> +	hunks that are fed into Git's diff and blame pipelines.
> +	If the tool returns zero hunks, the file is treated as
> +	unchanged for both diff output and blame attribution.
> +	See linkgit:gitattributes[5] for details.

... also here.

I do not know if you mean "the tool returns no hunks" (there is no
"hunk <old_start> <old_count> <new_start> <new_count>" line passed
from the tool over the protocol) or "the tool returns zero-hunk"
(there is a special "zero-hunk" message to signal this particular
condition sent over the protocol), and this description does not
quite help disambiguating between the two.

If the former, then avoid "zero hunks" as it sounds like a noun with
special meaning.  Yes, we can say "tool returns one hunk", "tool
returns 31 hunks", etc., so "tool returns zero hunks" may logically
be correct, but "when the tool returns no hunks with status=success"
is much less confusing, I think.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Mon, May 25, 2026 at 7:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Zero hunks with status=success means the tool considers the
> > files equivalent.  Git skips diff output for that file.
>
> Is "zero hunk" a common word or some random string you invented?  If
> the latter, which is I am assuming it to be, you should define what
> it means at/before the first use.  Here in the proposed log message,
> and ...
>
> >
> > Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
> > ---
> >  Documentation/config/diff.adoc   |   8 +
> >  Documentation/gitattributes.adoc |  40 ++++
> >  Makefile                         |   1 +
> >  diff-process.c                   | 206 +++++++++++++++++++
> >  diff-process.h                   |  28 +++
> >  diff.c                           |  23 +++
> >  t/.gitattributes                 |   1 +
> >  t/t4080-diff-process.sh          | 338 +++++++++++++++++++++++++++++++
> >  8 files changed, 645 insertions(+)
> >  create mode 100644 diff-process.c
> >  create mode 100644 diff-process.h
> >  create mode 100755 t/t4080-diff-process.sh
> >
> > diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc
> > index 1135a62a0a..4ab5f60df6 100644
> > --- a/Documentation/config/diff.adoc
> > +++ b/Documentation/config/diff.adoc
> > @@ -218,6 +218,14 @@ endif::git-diff[]
> >       Set this option to `true` to make the diff driver cache the text
> >       conversion outputs.  See linkgit:gitattributes[5] for details.
> >
> > +`diff.<driver>.process`::
> > +     The command to run as a long-running diff process.
> > +     The tool communicates via the pkt-line protocol and returns
> > +     hunks that are fed into Git's diff and blame pipelines.
> > +     If the tool returns zero hunks, the file is treated as
> > +     unchanged for both diff output and blame attribution.
> > +     See linkgit:gitattributes[5] for details.
>
> ... also here.
>
> I do not know if you mean "the tool returns no hunks" (there is no
> "hunk <old_start> <old_count> <new_start> <new_count>" line passed
> from the tool over the protocol) or "the tool returns zero-hunk"
> (there is a special "zero-hunk" message to signal this particular
> condition sent over the protocol), and this description does not
> quite help disambiguating between the two.
>
> If the former, then avoid "zero hunks" as it sounds like a noun with
> special meaning.  Yes, we can say "tool returns one hunk", "tool
> returns 31 hunks", etc., so "tool returns zero hunks" may logically
> be correct, but "when the tool returns no hunks with status=success"
> is much less confusing, I think.

Yes, "zero hunks" was my own invention and I see why it's confusing. Will
update the messaging to use "no hunks" instead and do a broader sweep of
the documentation to clarify the protocol and expected tool behavior.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Johannes Schindelin wrote on the Git mailing list (how to reply to this email):

Hi Michael,

I stumbled about this patch when it broke CI in Git for Windows, where we
do _not_ use `NO_PYTHON`, even though Python is unavailable in the
build/test CI jobs. The existing tests handle this situation gracefully,
this here patch does not:

On Sun, 7 Jun 2026, Michael Montalbo via GitGitGadget wrote:

> diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> new file mode 100755
> index 0000000000..f159cd86d8
> --- /dev/null
> +++ b/t/t4080-diff-process.sh
> @@ -0,0 +1,538 @@
> +#!/bin/sh
> +
> +test_description='diff process via long-running process'
> +
> +. ./test-lib.sh
> +
> +if test_have_prereq PYTHON
> +then
> +	PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python)

When neither `python3` nor `python` are available (which is the case in
the minimal Git for Windows SDK used in Git's CI runs), this fails under
`set -e`. Before even running the first test case. Resulting in an
unexpected TAP format error.

Now, we could "fix" this by imitating what `lib-p4` does (see
https://github.com/dscho/git/commit/bd0b5570c744f678911a67a62da63f30655f20d8
which demonstrates that it is indeed a work-around on Windows):

-- snip --
diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
index fdf6da1c341e67..bd22c247ff3856 100755
--- a/t/t4080-diff-process.sh
+++ b/t/t4080-diff-process.sh
@@ -4,9 +4,10 @@ test_description='diff process via long-running process'
 
 . ./test-lib.sh
 
-if test_have_prereq PYTHON
+if ! test_have_prereq PYTHON || ! test -x "$PYTHON_PATH"
 then
-	PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python)
+	skip_all='python interpreter not available'
+	test_done
 fi
 
 #
-- snap --

Of course, this uncovers _another_ problem with the Python script: It uses
Python3-only `f"..."` format strings, which cannot be handled by the
Python2 to which the `PYTHON_PATH` variable in `linux-TEST-vars` points.
So this requires _another follow-up (see also
https://github.com/dscho/git/commit/c12a9f4c80e5ce8db0fe370fac46fb45be2b775f):

-- snip --
diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
index bd22c247ff3856..ba14682a9086e4 100755
--- a/t/t4080-diff-process.sh
+++ b/t/t4080-diff-process.sh
@@ -39,7 +39,8 @@ setup_backend () {
 
 	def write_pkt(line):
 	    data = (line + "\n").encode()
-	    sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data)
+	    hdr = "{:04x}".format(len(data) + 4).encode()
+	    sys.stdout.buffer.write(hdr + data)
 	    sys.stdout.buffer.flush()
 
 	def write_flush():
@@ -98,7 +99,8 @@ setup_backend () {
 	    new = read_content()
 	    old_first = old.split(b"\n")[0].decode(errors="replace") if old else ""
 	    new_first = new.split(b"\n")[0].decode(errors="replace") if new else ""
-	    log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}")
+	    log("command={} pathname={} old={} new={}".format(
+	        cmd, pathname, old_first, new_first))
 
 	    if mode == "error":
 	        write_flush()
@@ -130,7 +132,7 @@ setup_backend () {
 	        else:
 	            ol = old.count(b"\n")
 	            nl = new.count(b"\n")
-	            write_pkt(f"hunk 1 {ol} 1 {nl}")
+	            write_pkt("hunk 1 {} 1 {}".format(ol, nl))
 	        write_flush()
 	        write_pkt("status=success")
 	        write_flush()
-- snap --

And this is still not enough to make it work with Python2, see
https://github.com/dscho/git/actions/runs/27091523842/job/79955895737:

-- snip --
[...]
+ git -c diff.cdiff.process=./diff-process-backend --mode=fixed-hunk diff boundary.c
  Traceback (most recent call last):
    File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 45, in <module>
      assert read_pkt() == "git-diff-client"
    File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 4, in read_pkt
      hdr = sys.stdin.buffer.read(4)
  AttributeError: 'file' object has no attribute 'buffer'
-- snap --

I have experienced similar patterns in my career, where a single decision
required multiple follow-up fixes _just_ to avoid having to revert that
decision. This kind of doubling down has never ended well.

Therefore I would like to take a step back, and ask: Is it _really_ a good
idea to use Python here? Are we certain that we want to _require_ Python
to run this test and skip it if Python isn't available (as is the case in
the Windows-related parts of Git's very own CI) even if Python has nothing
at all to do with the feature that is being tested?

I don't want to be doomed to repeat history, and we can very well learn
e.g. from prior art in this very project, where the tests for the
clean/smudge filters (which _also_ want to speak pkt-line over stdio)
needlessly incurred Perl as a requirement to run the tests. It was
Matheus's heroic work in 52917a998ef3a (t0021: implementation the
rot13-filter.pl script in C, 2022-08-14) and 4d1d843be7a15 (tests: use the
new C rot13-filter helper to avoid PERL prereq, 2022-08-14) that avoided
that unnecessary prerequisite.

Likewise, there is `test-tool pkt-line` intended for driving the pkt-line
protocol via simple shell scripts.

So the conscious project direction has been: fold pkt-line test backends
into `test-tool` and drop the scripting-language prereq. Reintroducing the
same shape in Python would walk this back.

Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl',
2025-04-29) has been another clear sign that the Git project is actively
_removing_ scripting-language dependencies from the build and test
infrastructure, not adding new ones.

The clear prior art in Git's own tests for what t4080 wants to do, as of
today, is `t/helper/test-rot13-filter.c`, which could be imitated here
instead of (re-)introducing a dependency on a scripting language other
than Unix shell in Git's test suite.

The `PYTHON` prereq exists in exactly five files today, all `git p4`
related (where Python is an intrinsic prerequisite given that `git-p4.py`
_is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`,
`t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`,
and `t/t9836-git-p4-metadata-encoding-python3.sh`.

After 7cdbff14d482 (remove merge-recursive-old, 2006-11-20), this here
patch would be the first one, after almost 20 years, to re-introduce
Python as a dependency outside `git p4`.

And it would also be the first ever to embed a Python script as a heredoc:

> +fi
> +
> +#
> +# A single parametric diff process.
> +# Usage: diff-process-backend --mode=<mode> [--log=<path>]
> +#
> +# Modes:
> +#   whole-file  - report all lines as changed (default)
> +#   fixed-hunk  - always report hunk 5 2 5 2
> +#   bad-hunk    - report out-of-bounds hunk 999 1 999 1
> +#   bad-sync    - report hunk with mismatched unchanged totals
> +#   overlap     - report two overlapping hunks
> +#   no-hunks   - return no hunks (files considered equivalent)
> +#   error       - return status=error for every request
> +#   abort       - return status=abort for every request
> +#   crash       - read one request then exit without responding
> +#
> +setup_backend () {
> +	cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF
> +	import sys, os
> +
> +	def read_pkt():
> +	    hdr = sys.stdin.buffer.read(4)
> +	    if len(hdr) < 4: return None
> +	    length = int(hdr, 16)
> +	    if length == 0: return ""
> +	    data = sys.stdin.buffer.read(length - 4)
> +	    return data.decode().rstrip("\n")
> +
> +	def write_pkt(line):
> +	    data = (line + "\n").encode()
> +	    sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data)
> +	    sys.stdout.buffer.flush()
> +
> +	def write_flush():
> +	    sys.stdout.buffer.write(b"0000")
> +	    sys.stdout.buffer.flush()
> +
> +	def read_content():
> +	    chunks = []
> +	    while True:
> +	        hdr = sys.stdin.buffer.read(4)
> +	        if len(hdr) < 4: break
> +	        length = int(hdr, 16)
> +	        if length == 0: break
> +	        chunks.append(sys.stdin.buffer.read(length - 4))
> +	    return b"".join(chunks)
> +
> +	mode = "whole-file"
> +	logfile = None
> +	for arg in sys.argv[1:]:
> +	    if arg.startswith("--mode="):
> +	        mode = arg[7:]
> +	    elif arg.startswith("--log="):
> +	        logfile = open(arg[6:], "a")
> +
> +	def log(msg):
> +	    if logfile:
> +	        logfile.write(msg + "\n")
> +	        logfile.flush()
> +
> +	# Handshake
> +	assert read_pkt() == "git-diff-client"
> +	assert read_pkt() == "version=1"
> +	read_pkt()
> +	write_pkt("git-diff-server")
> +	write_pkt("version=1")
> +	write_flush()
> +	while True:
> +	    p = read_pkt()
> +	    if p == "": break
> +	write_pkt("capability=hunks")
> +	write_flush()
> +
> +	log("ready")
> +
> +	while True:
> +	    cmd = None
> +	    pathname = None
> +	    while True:
> +	        p = read_pkt()
> +	        if p is None: sys.exit(0)
> +	        if p == "": break
> +	        if p.startswith("command="): cmd = p.split("=",1)[1]
> +	        if p.startswith("pathname="): pathname = p.split("=",1)[1]
> +	    if cmd is None: sys.exit(0)
> +	    old = read_content()
> +	    new = read_content()
> +	    old_first = old.split(b"\n")[0].decode(errors="replace") if old else ""
> +	    new_first = new.split(b"\n")[0].decode(errors="replace") if new else ""
> +	    log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}")
> +
> +	    if mode == "error":
> +	        write_flush()
> +	        write_pkt("status=error")
> +	        write_flush()
> +	        continue
> +
> +	    if mode == "abort":
> +	        write_flush()
> +	        write_pkt("status=abort")
> +	        write_flush()
> +	        continue
> +
> +	    if mode == "crash":
> +	        sys.exit(1)
> +
> +	    if cmd == "hunks":
> +	        if mode == "fixed-hunk":
> +	            write_pkt("hunk 5 2 5 2")
> +	        elif mode == "bad-hunk":
> +	            write_pkt("hunk 999 1 999 1")
> +	        elif mode == "bad-sync":
> +	            write_pkt("hunk 1 2 1 1")
> +	        elif mode == "overlap":
> +	            write_pkt("hunk 1 5 1 5")
> +	            write_pkt("hunk 3 2 3 2")
> +	        elif mode == "no-hunks":
> +	            pass
> +	        else:
> +	            ol = old.count(b"\n")
> +	            nl = new.count(b"\n")
> +	            write_pkt(f"hunk 1 {ol} 1 {nl}")
> +	        write_flush()
> +	        write_pkt("status=success")
> +	        write_flush()
> +	    else:
> +	        write_flush()
> +	        write_pkt("status=error")
> +	        write_flush()
> +	PYEOF

The existing pattern is to provide larger scripts as fixtures in
associated `t/tNNNN/` directories, not as heredoc, see e.g.
`t/t1509/prepare-chroot.sh`. Writing scripts, especially lengthy ones, in
heredoc strings makes it virtually impossible to use static code analysis
or syntax highlighting to fend off banal errors.

Given the complexity of what t4080 tries to test (error, abort, crash,
bad-sync, no-hunks, multiple files in one session, capability
negotiation), it would unfortunately be infeasible to use `test-tool
pkt-line` from a shell script implementing that `diff.*.process` protocol.

So I've spiked a demo how the `test-tool diff-process-backend` could look
like (letting Opus do the menial typing, so that I can enjoy at least part
of a sunny Sunday outside), which also passes the CI build and test:
https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6

That commit is of course not intended to be used as-is; Feel free to pick
code parts of it and integrate them into your topic branch. Or write your
own test-tool helper from scratch if that's more your jam.

Ciao,
Johannes

> +	write_script diff-process-backend <<-SHEOF
> +	exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@"
> +	SHEOF
> +}
> +
> +BACKEND="./diff-process-backend"
> +
> +test_expect_success PYTHON 'setup' '
> +	setup_backend &&
> +	echo "*.c diff=cdiff" >.gitattributes &&
> +	git add .gitattributes &&
> +
> +	# boundary.c: 10 lines, changes at 5-6 and 9-10.
> +	# Used by: hunk boundaries, error fallback, crash, bad hunks, overlap.
> +	cat >boundary.c <<-\EOF &&
> +	line1
> +	line2
> +	line3
> +	line4
> +	OLD5
> +	OLD6
> +	line7
> +	line8
> +	OLD9
> +	OLD10
> +	EOF
> +	git add boundary.c &&
> +
> +	# worddiff.c: single-line function, value changes 1 -> 999.
> +	# Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat.
> +	cat >worddiff.c <<-\EOF &&
> +	int value(void) { return 1; }
> +	EOF
> +	git add worddiff.c &&
> +
> +	# newfile.c: single-line function, value changes 42 -> 99.
> +	# Used by: new file, --exit-code, multiple drivers.
> +	cat >newfile.c <<-\EOF &&
> +	int new_func(void) { return 42; }
> +	EOF
> +	git add newfile.c &&
> +
> +	# logtest.c: single-line function for log/format-patch tests.
> +	# Needs two commits so log -1 has a diff.
> +	cat >logtest.c <<-\EOF &&
> +	int logfunc(void) { return 1; }
> +	EOF
> +	git add logtest.c &&
> +
> +	# two.c/one.c: two-file pair for error/abort/startup-failure tests.
> +	cat >one.c <<-\EOF &&
> +	int first(void) { return 1; }
> +	EOF
> +	cat >two.c <<-\EOF &&
> +	int second(void) { return 2; }
> +	EOF
> +	git add one.c two.c &&
> +
> +	git commit -m "initial" &&
> +
> +	# Second commit for logtest.c (so log -1 has something to show).
> +	cat >logtest.c <<-\EOF &&
> +	int logfunc(void) { return 2; }
> +	EOF
> +	git add logtest.c &&
> +	git commit -m "change logtest.c" &&
> +
> +	# Working tree modifications (not committed).
> +	cat >boundary.c <<-\EOF &&
> +	line1
> +	line2
> +	line3
> +	line4
> +	NEW5
> +	NEW6
> +	line7
> +	line8
> +	NEW9
> +	NEW10
> +	EOF
> +
> +	cat >worddiff.c <<-\EOF &&
> +	int value(void) { return 999; }
> +	EOF
> +
> +	cat >newfile.c <<-\EOF &&
> +	int new_func(void) { return 99; }
> +	EOF
> +
> +	cat >one.c <<-\EOF &&
> +	int first(void) { return 10; }
> +	EOF
> +
> +	cat >two.c <<-\EOF
> +	int second(void) { return 20; }
> +	EOF
> +'
> +
> +#
> +# Core behavior: the tool controls which lines are marked as changed.
> +#
> +
> +test_expect_success PYTHON 'diff process hunk boundaries affect output' '
> +	# The file has changes at lines 5-6 and 9-10, but fixed-hunk
> +	# only reports lines 5-6 as changed.  Lines 9-10 should not
> +	# appear as changed in the output.
> +	git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \
> +		diff boundary.c >actual &&
> +	test_grep "^-OLD5" actual &&
> +	test_grep "^-OLD6" actual &&
> +	test_grep "^+NEW5" actual &&
> +	test_grep "^+NEW6" actual &&
> +	test_grep ! "^-OLD9" actual &&
> +	test_grep ! "^-OLD10" actual &&
> +	test_grep ! "^+NEW9" actual &&
> +	test_grep ! "^+NEW10" actual
> +'
> +
> +test_expect_success PYTHON 'diff process works with new file' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff -- newfile.c >actual 2>stderr &&
> +	test_grep "return 99" actual &&
> +	test_grep "pathname=newfile.c" backend.log &&
> +	test_must_be_empty stderr
> +'
> +
> +test_expect_success PYTHON 'diff process works with added file (empty old side)' '
> +	cat >added.c <<-\EOF &&
> +	int added(void) { return 1; }
> +	EOF
> +	git add added.c &&
> +
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff --cached -- added.c >actual 2>stderr &&
> +	test_grep "added" actual &&
> +	test_grep "pathname=added.c" backend.log &&
> +	test_must_be_empty stderr
> +'
> +
> +test_expect_success PYTHON 'diff process skipped for binary files' '
> +	printf "\\0binary" >binary.c &&
> +	git add binary.c &&
> +	git commit -m "add binary" &&
> +	printf "\\0changed" >binary.c &&
> +
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff -- binary.c >actual &&
> +	test_grep "Binary files" actual &&
> +	test_path_is_missing backend.log
> +'
> +
> +test_expect_success PYTHON 'diff process not consulted for unmatched driver' '
> +	echo "not tracked by cdiff" >unmatched.txt &&
> +	git add unmatched.txt &&
> +	git commit -m "add unmatched.txt" &&
> +
> +	echo "modified" >unmatched.txt &&
> +
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff -- unmatched.txt >actual &&
> +	test_grep "modified" actual &&
> +	test_path_is_missing backend.log
> +'
> +
> +test_expect_success PYTHON 'multiple drivers use separate processes' '
> +	echo "*.h diff=hdiff" >>.gitattributes &&
> +	git add .gitattributes &&
> +
> +	cat >multi.h <<-\EOF &&
> +	int header(void) { return 1; }
> +	EOF
> +	git add multi.h &&
> +	git commit -m "add multi.h" &&
> +
> +	cat >multi.h <<-\EOF &&
> +	int header(void) { return 2; }
> +	EOF
> +
> +	rm -f backend-c.log backend-h.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \
> +	    -c diff.hdiff.process="$BACKEND --log=backend-h.log" \
> +		diff -- newfile.c multi.h >actual 2>stderr &&
> +	test_grep "pathname=newfile.c" backend-c.log &&
> +	test_grep "pathname=multi.h" backend-h.log &&
> +	test_must_be_empty stderr
> +'
> +
> +test_expect_success PYTHON 'diff process works alongside textconv' '
> +	write_script uppercase-filter <<-\EOF &&
> +	tr "a-z" "A-Z" <"$1"
> +	EOF
> +
> +	cat >textconv.c <<-\EOF &&
> +	hello world
> +	EOF
> +	git add textconv.c &&
> +	git commit -m "add textconv.c" &&
> +
> +	cat >textconv.c <<-\EOF &&
> +	goodbye world
> +	EOF
> +
> +	rm -f backend.log &&
> +	git -c diff.cdiff.textconv="./uppercase-filter" \
> +	    -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff -- textconv.c >actual 2>stderr &&
> +	# The diff process receives textconv-transformed (uppercase) content.
> +	test_grep "pathname=textconv.c" backend.log &&
> +	test_grep "old=HELLO WORLD" backend.log &&
> +	test_grep "new=GOODBYE WORLD" backend.log &&
> +	test_must_be_empty stderr
> +'
> +
> +#
> +# Downstream features: word diff, log, equivalent files, exit code.
> +#
> +
> +test_expect_success PYTHON 'diff process with --word-diff' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff --word-diff worddiff.c >actual 2>stderr &&
> +	test_grep "\[-1;-\]" actual &&
> +	test_grep "{+999;+}" actual &&
> +	test_grep "pathname=worddiff.c" backend.log &&
> +	test_must_be_empty stderr
> +'
> +
> +test_expect_success PYTHON 'diff process works with git log -p' '
> +	# With no-hunks mode, the tool says the files are equivalent,
> +	# so log -p should show the commit but no diff content.
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \
> +		log -1 -p -- logtest.c >actual 2>stderr &&
> +	test_grep "change logtest.c" actual &&
> +	test_grep ! "return 2" actual &&
> +	test_grep "command=hunks pathname=logtest.c" backend.log &&
> +	test_must_be_empty stderr
> +'
> +
> +test_expect_success PYTHON 'diff process no hunks suppresses diff output' '
> +	cat >nohunks.c <<-\EOF &&
> +	int zero(void) { return 0; }
> +	EOF
> +	git add nohunks.c &&
> +	git commit -m "add nohunks.c" &&
> +
> +	cat >nohunks.c <<-\EOF &&
> +	int zero(void) { return 999; }
> +	EOF
> +
> +	git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \
> +		diff nohunks.c >actual &&
> +	test_must_be_empty actual
> +'
> +
> +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' '
> +	git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \
> +		diff --exit-code nohunks.c
> +'
> +
> +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' '
> +	test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \
> +		diff --exit-code newfile.c
> +'
> +
> +#
> +# Bypass mechanisms: flags and commands that skip the diff process.
> +#
> +
> +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff --diff-algorithm=patience worddiff.c >actual &&
> +	test_grep "return 999" actual &&
> +	test_path_is_missing backend.log
> +'
> +
> +test_expect_success PYTHON 'diff process not used by --stat' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> +		diff --stat worddiff.c >actual &&
> +	test_grep "worddiff.c" actual &&
> +	test_path_is_missing backend.log
> +'
> +
> +#
> +# Error handling and fallback.
> +#
> +
> +test_expect_success PYTHON 'diff process fallback on tool error status' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \
> +		diff boundary.c >actual 2>stderr &&
> +	# Fallback produces the full builtin diff (both change regions).
> +	test_grep "^-OLD5" actual &&
> +	test_grep "^+NEW5" actual &&
> +	test_grep "^-OLD9" actual &&
> +	test_grep "^+NEW9" actual &&
> +	# Tool was contacted (it replied with error, not crash).
> +	test_grep "command=hunks pathname=boundary.c" backend.log &&
> +	test_grep "diff process.*failed" stderr
> +'
> +
> +test_expect_success PYTHON 'diff process error keeps tool available for next file' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \
> +		diff -- one.c two.c >actual 2>stderr &&
> +	# Unlike abort, error keeps the tool available: both files
> +	# are sent to the tool (and both fall back).
> +	test_grep "pathname=one.c" backend.log &&
> +	test_grep "pathname=two.c" backend.log &&
> +	test_grep "return 10" actual &&
> +	test_grep "return 20" actual
> +'
> +
> +test_expect_success PYTHON 'diff process abort disables for session' '
> +	rm -f backend.log &&
> +	git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \
> +		diff -- one.c two.c >actual &&
> +	# Both files should still produce diff output via fallback.
> +	test_grep "return 10" actual &&
> +	test_grep "return 20" actual &&
> +	# The tool aborts on the first file and git clears its
> +	# capability.  The second file never contacts the tool.
> +	test_grep "pathname=one.c" backend.log &&
> +	test_grep ! "pathname=two.c" backend.log
> +'
> +
> +test_expect_success PYTHON 'diff process fallback on tool crash' '
> +	git -c diff.cdiff.process="$BACKEND --mode=crash" \
> +		diff boundary.c >actual 2>stderr &&
> +	test_grep "^-OLD5" actual &&
> +	test_grep "^+NEW5" actual &&
> +	test_grep "^-OLD9" actual &&
> +	test_grep "^+NEW9" actual &&
> +	# Crash is a communication failure, so a warning is emitted.
> +	test_grep "diff process.*failed" stderr
> +'
> +
> +test_expect_success PYTHON 'diff process startup failure only warns once' '
> +	git -c diff.cdiff.process="/nonexistent/tool" \
> +		diff -- one.c two.c >actual 2>stderr &&
> +	# Both files produce diff output via fallback.
> +	test_grep "return 10" actual &&
> +	test_grep "return 20" actual &&
> +	# Sentinel prevents repeated warnings: only one, not one per file.
> +	test_grep "diff process.*failed" stderr >warnings &&
> +	test_line_count = 1 warnings
> +'
> +
> +test_expect_success PYTHON 'diff process fallback on bad hunks' '
> +	git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \
> +		diff boundary.c >actual 2>stderr &&
> +	test_grep "^-OLD5" actual &&
> +	test_grep "^+NEW5" actual &&
> +	test_grep "^-OLD9" actual &&
> +	test_grep "^+NEW9" actual &&
> +	# Invalid hunks are caught by xdiff validation, not the
> +	# protocol layer, so no warning is emitted.
> +	test_must_be_empty stderr
> +'
> +
> +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' '
> +	cat >synctest.c <<-\EOF &&
> +	line1
> +	line2
> +	line3
> +	EOF
> +	git add synctest.c &&
> +	git commit -m "add synctest.c" &&
> +
> +	cat >synctest.c <<-\EOF &&
> +	line1
> +	changed
> +	line3
> +	EOF
> +
> +	# bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new
> +	# line as changed, leaving 1 unchanged old vs 2 unchanged new.
> +	# The synchronization invariant fails and git falls back.
> +	git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \
> +		diff synctest.c >actual 2>stderr &&
> +	test_grep "changed" actual
> +'
> +
> +test_expect_success PYTHON 'diff process fallback on overlapping hunks' '
> +	# boundary.c has 10 lines, so both hunks are in bounds
> +	# but they overlap at lines 3-5, triggering the ordering check.
> +	git -c diff.cdiff.process="$BACKEND --mode=overlap" \
> +		diff boundary.c >actual 2>stderr &&
> +	test_grep "NEW5" actual
> +'
> +
> +test_done
> diff --git a/userdiff.h b/userdiff.h
> index 51c26e0d41..a98eabe377 100644
> --- a/userdiff.h
> +++ b/userdiff.h
> @@ -3,6 +3,7 @@
>  
>  #include "notes-cache.h"
>  
> +struct diff_subprocess;
>  struct index_state;
>  struct repository;
>  
> @@ -33,6 +34,8 @@ struct userdiff_driver {
>  	int textconv_want_cache;
>  	const char *process;
>  	char *process_owned;
> +	struct diff_subprocess *diff_subprocess;
> +	unsigned diff_process_failed : 1;
>  };
>  enum userdiff_driver_type {
>  	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
> -- 
> gitgitgadget
> 
> 
> 

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Sun, Jun 7, 2026 at 7:36 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Michael,
>
> I stumbled about this patch when it broke CI in Git for Windows, where we
> do _not_ use `NO_PYTHON`, even though Python is unavailable in the
> build/test CI jobs. The existing tests handle this situation gracefully,
> this here patch does not:
>
> On Sun, 7 Jun 2026, Michael Montalbo via GitGitGadget wrote:
>
> > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> > new file mode 100755
> > index 0000000000..f159cd86d8
> > --- /dev/null
> > +++ b/t/t4080-diff-process.sh
> > @@ -0,0 +1,538 @@
> > +#!/bin/sh
> > +
> > +test_description='diff process via long-running process'
> > +
> > +. ./test-lib.sh
> > +
> > +if test_have_prereq PYTHON
> > +then
> > +     PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python)
>
> When neither `python3` nor `python` are available (which is the case in
> the minimal Git for Windows SDK used in Git's CI runs), this fails under
> `set -e`. Before even running the first test case. Resulting in an
> unexpected TAP format error.
>
> Now, we could "fix" this by imitating what `lib-p4` does (see
> https://github.com/dscho/git/commit/bd0b5570c744f678911a67a62da63f30655f20d8
> which demonstrates that it is indeed a work-around on Windows):
>
> -- snip --
> diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> index fdf6da1c341e67..bd22c247ff3856 100755
> --- a/t/t4080-diff-process.sh
> +++ b/t/t4080-diff-process.sh
> @@ -4,9 +4,10 @@ test_description='diff process via long-running process'
>
>  . ./test-lib.sh
>
> -if test_have_prereq PYTHON
> +if ! test_have_prereq PYTHON || ! test -x "$PYTHON_PATH"
>  then
> -       PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python)
> +       skip_all='python interpreter not available'
> +       test_done
>  fi
>
>  #
> -- snap --
>
> Of course, this uncovers _another_ problem with the Python script: It uses
> Python3-only `f"..."` format strings, which cannot be handled by the
> Python2 to which the `PYTHON_PATH` variable in `linux-TEST-vars` points.
> So this requires _another follow-up (see also
> https://github.com/dscho/git/commit/c12a9f4c80e5ce8db0fe370fac46fb45be2b775f):
>
> -- snip --
> diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> index bd22c247ff3856..ba14682a9086e4 100755
> --- a/t/t4080-diff-process.sh
> +++ b/t/t4080-diff-process.sh
> @@ -39,7 +39,8 @@ setup_backend () {
>
>         def write_pkt(line):
>             data = (line + "\n").encode()
> -           sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data)
> +           hdr = "{:04x}".format(len(data) + 4).encode()
> +           sys.stdout.buffer.write(hdr + data)
>             sys.stdout.buffer.flush()
>
>         def write_flush():
> @@ -98,7 +99,8 @@ setup_backend () {
>             new = read_content()
>             old_first = old.split(b"\n")[0].decode(errors="replace") if old else ""
>             new_first = new.split(b"\n")[0].decode(errors="replace") if new else ""
> -           log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}")
> +           log("command={} pathname={} old={} new={}".format(
> +               cmd, pathname, old_first, new_first))
>
>             if mode == "error":
>                 write_flush()
> @@ -130,7 +132,7 @@ setup_backend () {
>                 else:
>                     ol = old.count(b"\n")
>                     nl = new.count(b"\n")
> -                   write_pkt(f"hunk 1 {ol} 1 {nl}")
> +                   write_pkt("hunk 1 {} 1 {}".format(ol, nl))
>                 write_flush()
>                 write_pkt("status=success")
>                 write_flush()
> -- snap --
>
> And this is still not enough to make it work with Python2, see
> https://github.com/dscho/git/actions/runs/27091523842/job/79955895737:
>
> -- snip --
> [...]
> + git -c diff.cdiff.process=./diff-process-backend --mode=fixed-hunk diff boundary.c
>   Traceback (most recent call last):
>     File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 45, in <module>
>       assert read_pkt() == "git-diff-client"
>     File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 4, in read_pkt
>       hdr = sys.stdin.buffer.read(4)
>   AttributeError: 'file' object has no attribute 'buffer'
> -- snap --
>
> I have experienced similar patterns in my career, where a single decision
> required multiple follow-up fixes _just_ to avoid having to revert that
> decision. This kind of doubling down has never ended well.
>
> Therefore I would like to take a step back, and ask: Is it _really_ a good
> idea to use Python here? Are we certain that we want to _require_ Python
> to run this test and skip it if Python isn't available (as is the case in
> the Windows-related parts of Git's very own CI) even if Python has nothing
> at all to do with the feature that is being tested?
>
> I don't want to be doomed to repeat history, and we can very well learn
> e.g. from prior art in this very project, where the tests for the
> clean/smudge filters (which _also_ want to speak pkt-line over stdio)
> needlessly incurred Perl as a requirement to run the tests. It was
> Matheus's heroic work in 52917a998ef3a (t0021: implementation the
> rot13-filter.pl script in C, 2022-08-14) and 4d1d843be7a15 (tests: use the
> new C rot13-filter helper to avoid PERL prereq, 2022-08-14) that avoided
> that unnecessary prerequisite.
>
> Likewise, there is `test-tool pkt-line` intended for driving the pkt-line
> protocol via simple shell scripts.
>
> So the conscious project direction has been: fold pkt-line test backends
> into `test-tool` and drop the scripting-language prereq. Reintroducing the
> same shape in Python would walk this back.
>
> Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl',
> 2025-04-29) has been another clear sign that the Git project is actively
> _removing_ scripting-language dependencies from the build and test
> infrastructure, not adding new ones.
>
> The clear prior art in Git's own tests for what t4080 wants to do, as of
> today, is `t/helper/test-rot13-filter.c`, which could be imitated here
> instead of (re-)introducing a dependency on a scripting language other
> than Unix shell in Git's test suite.
>
> The `PYTHON` prereq exists in exactly five files today, all `git p4`
> related (where Python is an intrinsic prerequisite given that `git-p4.py`
> _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`,
> `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`,
> and `t/t9836-git-p4-metadata-encoding-python3.sh`.
>
> After 7cdbff14d482 (remove merge-recursive-old, 2006-11-20), this here
> patch would be the first one, after almost 20 years, to re-introduce
> Python as a dependency outside `git p4`.
>
> And it would also be the first ever to embed a Python script as a heredoc:
>
> > +fi
> > +
> > +#
> > +# A single parametric diff process.
> > +# Usage: diff-process-backend --mode=<mode> [--log=<path>]
> > +#
> > +# Modes:
> > +#   whole-file  - report all lines as changed (default)
> > +#   fixed-hunk  - always report hunk 5 2 5 2
> > +#   bad-hunk    - report out-of-bounds hunk 999 1 999 1
> > +#   bad-sync    - report hunk with mismatched unchanged totals
> > +#   overlap     - report two overlapping hunks
> > +#   no-hunks   - return no hunks (files considered equivalent)
> > +#   error       - return status=error for every request
> > +#   abort       - return status=abort for every request
> > +#   crash       - read one request then exit without responding
> > +#
> > +setup_backend () {
> > +     cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF
> > +     import sys, os
> > +
> > +     def read_pkt():
> > +         hdr = sys.stdin.buffer.read(4)
> > +         if len(hdr) < 4: return None
> > +         length = int(hdr, 16)
> > +         if length == 0: return ""
> > +         data = sys.stdin.buffer.read(length - 4)
> > +         return data.decode().rstrip("\n")
> > +
> > +     def write_pkt(line):
> > +         data = (line + "\n").encode()
> > +         sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data)
> > +         sys.stdout.buffer.flush()
> > +
> > +     def write_flush():
> > +         sys.stdout.buffer.write(b"0000")
> > +         sys.stdout.buffer.flush()
> > +
> > +     def read_content():
> > +         chunks = []
> > +         while True:
> > +             hdr = sys.stdin.buffer.read(4)
> > +             if len(hdr) < 4: break
> > +             length = int(hdr, 16)
> > +             if length == 0: break
> > +             chunks.append(sys.stdin.buffer.read(length - 4))
> > +         return b"".join(chunks)
> > +
> > +     mode = "whole-file"
> > +     logfile = None
> > +     for arg in sys.argv[1:]:
> > +         if arg.startswith("--mode="):
> > +             mode = arg[7:]
> > +         elif arg.startswith("--log="):
> > +             logfile = open(arg[6:], "a")
> > +
> > +     def log(msg):
> > +         if logfile:
> > +             logfile.write(msg + "\n")
> > +             logfile.flush()
> > +
> > +     # Handshake
> > +     assert read_pkt() == "git-diff-client"
> > +     assert read_pkt() == "version=1"
> > +     read_pkt()
> > +     write_pkt("git-diff-server")
> > +     write_pkt("version=1")
> > +     write_flush()
> > +     while True:
> > +         p = read_pkt()
> > +         if p == "": break
> > +     write_pkt("capability=hunks")
> > +     write_flush()
> > +
> > +     log("ready")
> > +
> > +     while True:
> > +         cmd = None
> > +         pathname = None
> > +         while True:
> > +             p = read_pkt()
> > +             if p is None: sys.exit(0)
> > +             if p == "": break
> > +             if p.startswith("command="): cmd = p.split("=",1)[1]
> > +             if p.startswith("pathname="): pathname = p.split("=",1)[1]
> > +         if cmd is None: sys.exit(0)
> > +         old = read_content()
> > +         new = read_content()
> > +         old_first = old.split(b"\n")[0].decode(errors="replace") if old else ""
> > +         new_first = new.split(b"\n")[0].decode(errors="replace") if new else ""
> > +         log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}")
> > +
> > +         if mode == "error":
> > +             write_flush()
> > +             write_pkt("status=error")
> > +             write_flush()
> > +             continue
> > +
> > +         if mode == "abort":
> > +             write_flush()
> > +             write_pkt("status=abort")
> > +             write_flush()
> > +             continue
> > +
> > +         if mode == "crash":
> > +             sys.exit(1)
> > +
> > +         if cmd == "hunks":
> > +             if mode == "fixed-hunk":
> > +                 write_pkt("hunk 5 2 5 2")
> > +             elif mode == "bad-hunk":
> > +                 write_pkt("hunk 999 1 999 1")
> > +             elif mode == "bad-sync":
> > +                 write_pkt("hunk 1 2 1 1")
> > +             elif mode == "overlap":
> > +                 write_pkt("hunk 1 5 1 5")
> > +                 write_pkt("hunk 3 2 3 2")
> > +             elif mode == "no-hunks":
> > +                 pass
> > +             else:
> > +                 ol = old.count(b"\n")
> > +                 nl = new.count(b"\n")
> > +                 write_pkt(f"hunk 1 {ol} 1 {nl}")
> > +             write_flush()
> > +             write_pkt("status=success")
> > +             write_flush()
> > +         else:
> > +             write_flush()
> > +             write_pkt("status=error")
> > +             write_flush()
> > +     PYEOF
>
> The existing pattern is to provide larger scripts as fixtures in
> associated `t/tNNNN/` directories, not as heredoc, see e.g.
> `t/t1509/prepare-chroot.sh`. Writing scripts, especially lengthy ones, in
> heredoc strings makes it virtually impossible to use static code analysis
> or syntax highlighting to fend off banal errors.
>
> Given the complexity of what t4080 tries to test (error, abort, crash,
> bad-sync, no-hunks, multiple files in one session, capability
> negotiation), it would unfortunately be infeasible to use `test-tool
> pkt-line` from a shell script implementing that `diff.*.process` protocol.
>
> So I've spiked a demo how the `test-tool diff-process-backend` could look
> like (letting Opus do the menial typing, so that I can enjoy at least part
> of a sunny Sunday outside), which also passes the CI build and test:
> https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6
>
> That commit is of course not intended to be used as-is; Feel free to pick
> code parts of it and integrate them into your topic branch. Or write your
> own test-tool helper from scratch if that's more your jam.
>

Johannes, thank you for the great feedback. The historical context is
really helpful and
the concerns you raise make a lot of sense. I will take a look at your
spike and also work
on removing Python from the test.

> Ciao,
> Johannes
>
> > +     write_script diff-process-backend <<-SHEOF
> > +     exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@"
> > +     SHEOF
> > +}
> > +
> > +BACKEND="./diff-process-backend"
> > +
> > +test_expect_success PYTHON 'setup' '
> > +     setup_backend &&
> > +     echo "*.c diff=cdiff" >.gitattributes &&
> > +     git add .gitattributes &&
> > +
> > +     # boundary.c: 10 lines, changes at 5-6 and 9-10.
> > +     # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap.
> > +     cat >boundary.c <<-\EOF &&
> > +     line1
> > +     line2
> > +     line3
> > +     line4
> > +     OLD5
> > +     OLD6
> > +     line7
> > +     line8
> > +     OLD9
> > +     OLD10
> > +     EOF
> > +     git add boundary.c &&
> > +
> > +     # worddiff.c: single-line function, value changes 1 -> 999.
> > +     # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat.
> > +     cat >worddiff.c <<-\EOF &&
> > +     int value(void) { return 1; }
> > +     EOF
> > +     git add worddiff.c &&
> > +
> > +     # newfile.c: single-line function, value changes 42 -> 99.
> > +     # Used by: new file, --exit-code, multiple drivers.
> > +     cat >newfile.c <<-\EOF &&
> > +     int new_func(void) { return 42; }
> > +     EOF
> > +     git add newfile.c &&
> > +
> > +     # logtest.c: single-line function for log/format-patch tests.
> > +     # Needs two commits so log -1 has a diff.
> > +     cat >logtest.c <<-\EOF &&
> > +     int logfunc(void) { return 1; }
> > +     EOF
> > +     git add logtest.c &&
> > +
> > +     # two.c/one.c: two-file pair for error/abort/startup-failure tests.
> > +     cat >one.c <<-\EOF &&
> > +     int first(void) { return 1; }
> > +     EOF
> > +     cat >two.c <<-\EOF &&
> > +     int second(void) { return 2; }
> > +     EOF
> > +     git add one.c two.c &&
> > +
> > +     git commit -m "initial" &&
> > +
> > +     # Second commit for logtest.c (so log -1 has something to show).
> > +     cat >logtest.c <<-\EOF &&
> > +     int logfunc(void) { return 2; }
> > +     EOF
> > +     git add logtest.c &&
> > +     git commit -m "change logtest.c" &&
> > +
> > +     # Working tree modifications (not committed).
> > +     cat >boundary.c <<-\EOF &&
> > +     line1
> > +     line2
> > +     line3
> > +     line4
> > +     NEW5
> > +     NEW6
> > +     line7
> > +     line8
> > +     NEW9
> > +     NEW10
> > +     EOF
> > +
> > +     cat >worddiff.c <<-\EOF &&
> > +     int value(void) { return 999; }
> > +     EOF
> > +
> > +     cat >newfile.c <<-\EOF &&
> > +     int new_func(void) { return 99; }
> > +     EOF
> > +
> > +     cat >one.c <<-\EOF &&
> > +     int first(void) { return 10; }
> > +     EOF
> > +
> > +     cat >two.c <<-\EOF
> > +     int second(void) { return 20; }
> > +     EOF
> > +'
> > +
> > +#
> > +# Core behavior: the tool controls which lines are marked as changed.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process hunk boundaries affect output' '
> > +     # The file has changes at lines 5-6 and 9-10, but fixed-hunk
> > +     # only reports lines 5-6 as changed.  Lines 9-10 should not
> > +     # appear as changed in the output.
> > +     git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \
> > +             diff boundary.c >actual &&
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^-OLD6" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^+NEW6" actual &&
> > +     test_grep ! "^-OLD9" actual &&
> > +     test_grep ! "^-OLD10" actual &&
> > +     test_grep ! "^+NEW9" actual &&
> > +     test_grep ! "^+NEW10" actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works with new file' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- newfile.c >actual 2>stderr &&
> > +     test_grep "return 99" actual &&
> > +     test_grep "pathname=newfile.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works with added file (empty old side)' '
> > +     cat >added.c <<-\EOF &&
> > +     int added(void) { return 1; }
> > +     EOF
> > +     git add added.c &&
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --cached -- added.c >actual 2>stderr &&
> > +     test_grep "added" actual &&
> > +     test_grep "pathname=added.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process skipped for binary files' '
> > +     printf "\\0binary" >binary.c &&
> > +     git add binary.c &&
> > +     git commit -m "add binary" &&
> > +     printf "\\0changed" >binary.c &&
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- binary.c >actual &&
> > +     test_grep "Binary files" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'diff process not consulted for unmatched driver' '
> > +     echo "not tracked by cdiff" >unmatched.txt &&
> > +     git add unmatched.txt &&
> > +     git commit -m "add unmatched.txt" &&
> > +
> > +     echo "modified" >unmatched.txt &&
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- unmatched.txt >actual &&
> > +     test_grep "modified" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'multiple drivers use separate processes' '
> > +     echo "*.h diff=hdiff" >>.gitattributes &&
> > +     git add .gitattributes &&
> > +
> > +     cat >multi.h <<-\EOF &&
> > +     int header(void) { return 1; }
> > +     EOF
> > +     git add multi.h &&
> > +     git commit -m "add multi.h" &&
> > +
> > +     cat >multi.h <<-\EOF &&
> > +     int header(void) { return 2; }
> > +     EOF
> > +
> > +     rm -f backend-c.log backend-h.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \
> > +         -c diff.hdiff.process="$BACKEND --log=backend-h.log" \
> > +             diff -- newfile.c multi.h >actual 2>stderr &&
> > +     test_grep "pathname=newfile.c" backend-c.log &&
> > +     test_grep "pathname=multi.h" backend-h.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works alongside textconv' '
> > +     write_script uppercase-filter <<-\EOF &&
> > +     tr "a-z" "A-Z" <"$1"
> > +     EOF
> > +
> > +     cat >textconv.c <<-\EOF &&
> > +     hello world
> > +     EOF
> > +     git add textconv.c &&
> > +     git commit -m "add textconv.c" &&
> > +
> > +     cat >textconv.c <<-\EOF &&
> > +     goodbye world
> > +     EOF
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.textconv="./uppercase-filter" \
> > +         -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- textconv.c >actual 2>stderr &&
> > +     # The diff process receives textconv-transformed (uppercase) content.
> > +     test_grep "pathname=textconv.c" backend.log &&
> > +     test_grep "old=HELLO WORLD" backend.log &&
> > +     test_grep "new=GOODBYE WORLD" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +#
> > +# Downstream features: word diff, log, equivalent files, exit code.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process with --word-diff' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --word-diff worddiff.c >actual 2>stderr &&
> > +     test_grep "\[-1;-\]" actual &&
> > +     test_grep "{+999;+}" actual &&
> > +     test_grep "pathname=worddiff.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works with git log -p' '
> > +     # With no-hunks mode, the tool says the files are equivalent,
> > +     # so log -p should show the commit but no diff content.
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \
> > +             log -1 -p -- logtest.c >actual 2>stderr &&
> > +     test_grep "change logtest.c" actual &&
> > +     test_grep ! "return 2" actual &&
> > +     test_grep "command=hunks pathname=logtest.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process no hunks suppresses diff output' '
> > +     cat >nohunks.c <<-\EOF &&
> > +     int zero(void) { return 0; }
> > +     EOF
> > +     git add nohunks.c &&
> > +     git commit -m "add nohunks.c" &&
> > +
> > +     cat >nohunks.c <<-\EOF &&
> > +     int zero(void) { return 999; }
> > +     EOF
> > +
> > +     git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \
> > +             diff nohunks.c >actual &&
> > +     test_must_be_empty actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' '
> > +     git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \
> > +             diff --exit-code nohunks.c
> > +'
> > +
> > +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' '
> > +     test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \
> > +             diff --exit-code newfile.c
> > +'
> > +
> > +#
> > +# Bypass mechanisms: flags and commands that skip the diff process.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --diff-algorithm=patience worddiff.c >actual &&
> > +     test_grep "return 999" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'diff process not used by --stat' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --stat worddiff.c >actual &&
> > +     test_grep "worddiff.c" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +#
> > +# Error handling and fallback.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process fallback on tool error status' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     # Fallback produces the full builtin diff (both change regions).
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^-OLD9" actual &&
> > +     test_grep "^+NEW9" actual &&
> > +     # Tool was contacted (it replied with error, not crash).
> > +     test_grep "command=hunks pathname=boundary.c" backend.log &&
> > +     test_grep "diff process.*failed" stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process error keeps tool available for next file' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \
> > +             diff -- one.c two.c >actual 2>stderr &&
> > +     # Unlike abort, error keeps the tool available: both files
> > +     # are sent to the tool (and both fall back).
> > +     test_grep "pathname=one.c" backend.log &&
> > +     test_grep "pathname=two.c" backend.log &&
> > +     test_grep "return 10" actual &&
> > +     test_grep "return 20" actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process abort disables for session' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \
> > +             diff -- one.c two.c >actual &&
> > +     # Both files should still produce diff output via fallback.
> > +     test_grep "return 10" actual &&
> > +     test_grep "return 20" actual &&
> > +     # The tool aborts on the first file and git clears its
> > +     # capability.  The second file never contacts the tool.
> > +     test_grep "pathname=one.c" backend.log &&
> > +     test_grep ! "pathname=two.c" backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on tool crash' '
> > +     git -c diff.cdiff.process="$BACKEND --mode=crash" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^-OLD9" actual &&
> > +     test_grep "^+NEW9" actual &&
> > +     # Crash is a communication failure, so a warning is emitted.
> > +     test_grep "diff process.*failed" stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process startup failure only warns once' '
> > +     git -c diff.cdiff.process="/nonexistent/tool" \
> > +             diff -- one.c two.c >actual 2>stderr &&
> > +     # Both files produce diff output via fallback.
> > +     test_grep "return 10" actual &&
> > +     test_grep "return 20" actual &&
> > +     # Sentinel prevents repeated warnings: only one, not one per file.
> > +     test_grep "diff process.*failed" stderr >warnings &&
> > +     test_line_count = 1 warnings
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on bad hunks' '
> > +     git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^-OLD9" actual &&
> > +     test_grep "^+NEW9" actual &&
> > +     # Invalid hunks are caught by xdiff validation, not the
> > +     # protocol layer, so no warning is emitted.
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' '
> > +     cat >synctest.c <<-\EOF &&
> > +     line1
> > +     line2
> > +     line3
> > +     EOF
> > +     git add synctest.c &&
> > +     git commit -m "add synctest.c" &&
> > +
> > +     cat >synctest.c <<-\EOF &&
> > +     line1
> > +     changed
> > +     line3
> > +     EOF
> > +
> > +     # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new
> > +     # line as changed, leaving 1 unchanged old vs 2 unchanged new.
> > +     # The synchronization invariant fails and git falls back.
> > +     git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \
> > +             diff synctest.c >actual 2>stderr &&
> > +     test_grep "changed" actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on overlapping hunks' '
> > +     # boundary.c has 10 lines, so both hunks are in bounds
> > +     # but they overlap at lines 3-5, triggering the ordering check.
> > +     git -c diff.cdiff.process="$BACKEND --mode=overlap" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     test_grep "NEW5" actual
> > +'
> > +
> > +test_done
> > diff --git a/userdiff.h b/userdiff.h
> > index 51c26e0d41..a98eabe377 100644
> > --- a/userdiff.h
> > +++ b/userdiff.h
> > @@ -3,6 +3,7 @@
> >
> >  #include "notes-cache.h"
> >
> > +struct diff_subprocess;
> >  struct index_state;
> >  struct repository;
> >
> > @@ -33,6 +34,8 @@ struct userdiff_driver {
> >       int textconv_want_cache;
> >       const char *process;
> >       char *process_owned;
> > +     struct diff_subprocess *diff_subprocess;
> > +     unsigned diff_process_failed : 1;
> >  };
> >  enum userdiff_driver_type {
> >       USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
> > --
> > gitgitgadget
> >
> >
> >

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael Montalbo wrote on the Git mailing list (how to reply to this email):

On Sun, Jun 7, 2026 at 7:36 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Michael,
>
> I stumbled about this patch when it broke CI in Git for Windows, where we
> do _not_ use `NO_PYTHON`, even though Python is unavailable in the
> build/test CI jobs. The existing tests handle this situation gracefully,
> this here patch does not:
>
> On Sun, 7 Jun 2026, Michael Montalbo via GitGitGadget wrote:
>
> > diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> > new file mode 100755
> > index 0000000000..f159cd86d8
> > --- /dev/null
> > +++ b/t/t4080-diff-process.sh
> > @@ -0,0 +1,538 @@
> > +#!/bin/sh
> > +
> > +test_description='diff process via long-running process'
> > +
> > +. ./test-lib.sh
> > +
> > +if test_have_prereq PYTHON
> > +then
> > +     PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python)
>
> When neither `python3` nor `python` are available (which is the case in
> the minimal Git for Windows SDK used in Git's CI runs), this fails under
> `set -e`. Before even running the first test case. Resulting in an
> unexpected TAP format error.
>
> Now, we could "fix" this by imitating what `lib-p4` does (see
> https://github.com/dscho/git/commit/bd0b5570c744f678911a67a62da63f30655f20d8
> which demonstrates that it is indeed a work-around on Windows):
>
> -- snip --
> diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> index fdf6da1c341e67..bd22c247ff3856 100755
> --- a/t/t4080-diff-process.sh
> +++ b/t/t4080-diff-process.sh
> @@ -4,9 +4,10 @@ test_description='diff process via long-running process'
>
>  . ./test-lib.sh
>
> -if test_have_prereq PYTHON
> +if ! test_have_prereq PYTHON || ! test -x "$PYTHON_PATH"
>  then
> -       PYTHON_PATH=$(command -v python3) || PYTHON_PATH=$(command -v python)
> +       skip_all='python interpreter not available'
> +       test_done
>  fi
>
>  #
> -- snap --
>
> Of course, this uncovers _another_ problem with the Python script: It uses
> Python3-only `f"..."` format strings, which cannot be handled by the
> Python2 to which the `PYTHON_PATH` variable in `linux-TEST-vars` points.
> So this requires _another follow-up (see also
> https://github.com/dscho/git/commit/c12a9f4c80e5ce8db0fe370fac46fb45be2b775f):
>
> -- snip --
> diff --git a/t/t4080-diff-process.sh b/t/t4080-diff-process.sh
> index bd22c247ff3856..ba14682a9086e4 100755
> --- a/t/t4080-diff-process.sh
> +++ b/t/t4080-diff-process.sh
> @@ -39,7 +39,8 @@ setup_backend () {
>
>         def write_pkt(line):
>             data = (line + "\n").encode()
> -           sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data)
> +           hdr = "{:04x}".format(len(data) + 4).encode()
> +           sys.stdout.buffer.write(hdr + data)
>             sys.stdout.buffer.flush()
>
>         def write_flush():
> @@ -98,7 +99,8 @@ setup_backend () {
>             new = read_content()
>             old_first = old.split(b"\n")[0].decode(errors="replace") if old else ""
>             new_first = new.split(b"\n")[0].decode(errors="replace") if new else ""
> -           log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}")
> +           log("command={} pathname={} old={} new={}".format(
> +               cmd, pathname, old_first, new_first))
>
>             if mode == "error":
>                 write_flush()
> @@ -130,7 +132,7 @@ setup_backend () {
>                 else:
>                     ol = old.count(b"\n")
>                     nl = new.count(b"\n")
> -                   write_pkt(f"hunk 1 {ol} 1 {nl}")
> +                   write_pkt("hunk 1 {} 1 {}".format(ol, nl))
>                 write_flush()
>                 write_pkt("status=success")
>                 write_flush()
> -- snap --
>
> And this is still not enough to make it work with Python2, see
> https://github.com/dscho/git/actions/runs/27091523842/job/79955895737:
>
> -- snip --
> [...]
> + git -c diff.cdiff.process=./diff-process-backend --mode=fixed-hunk diff boundary.c
>   Traceback (most recent call last):
>     File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 45, in <module>
>       assert read_pkt() == "git-diff-client"
>     File "/__w/git/git/t/trash directory.t4080-diff-process/diff-process-backend.py", line 4, in read_pkt
>       hdr = sys.stdin.buffer.read(4)
>   AttributeError: 'file' object has no attribute 'buffer'
> -- snap --
>
> I have experienced similar patterns in my career, where a single decision
> required multiple follow-up fixes _just_ to avoid having to revert that
> decision. This kind of doubling down has never ended well.
>
> Therefore I would like to take a step back, and ask: Is it _really_ a good
> idea to use Python here? Are we certain that we want to _require_ Python
> to run this test and skip it if Python isn't available (as is the case in
> the Windows-related parts of Git's very own CI) even if Python has nothing
> at all to do with the feature that is being tested?
>
> I don't want to be doomed to repeat history, and we can very well learn
> e.g. from prior art in this very project, where the tests for the
> clean/smudge filters (which _also_ want to speak pkt-line over stdio)
> needlessly incurred Perl as a requirement to run the tests. It was
> Matheus's heroic work in 52917a998ef3a (t0021: implementation the
> rot13-filter.pl script in C, 2022-08-14) and 4d1d843be7a15 (tests: use the
> new C rot13-filter helper to avoid PERL prereq, 2022-08-14) that avoided
> that unnecessary prerequisite.
>
> Likewise, there is `test-tool pkt-line` intended for driving the pkt-line
> protocol via simple shell scripts.
>
> So the conscious project direction has been: fold pkt-line test backends
> into `test-tool` and drop the scripting-language prereq. Reintroducing the
> same shape in Python would walk this back.
>
> Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl',
> 2025-04-29) has been another clear sign that the Git project is actively
> _removing_ scripting-language dependencies from the build and test
> infrastructure, not adding new ones.
>

Now I wonder if the extension / addition of more Perl test infra with my other
series:

https://lore.kernel.org/git/pull.2135.git.1780559158.gitgitgadget@gmail.com/T/

also goes against the project direction w.r.t. removing scripting languages.
Maybe I should also re-evaluate the usage of Perl there. I am leveraging
existing shell parsing logic written in Perl, but maybe the preference for
Perl-based lint rules is a mistake and should be avoided.

> The clear prior art in Git's own tests for what t4080 wants to do, as of
> today, is `t/helper/test-rot13-filter.c`, which could be imitated here
> instead of (re-)introducing a dependency on a scripting language other
> than Unix shell in Git's test suite.
>
> The `PYTHON` prereq exists in exactly five files today, all `git p4`
> related (where Python is an intrinsic prerequisite given that `git-p4.py`
> _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`,
> `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`,
> and `t/t9836-git-p4-metadata-encoding-python3.sh`.
>
> After 7cdbff14d482 (remove merge-recursive-old, 2006-11-20), this here
> patch would be the first one, after almost 20 years, to re-introduce
> Python as a dependency outside `git p4`.
>
> And it would also be the first ever to embed a Python script as a heredoc:
>
> > +fi
> > +
> > +#
> > +# A single parametric diff process.
> > +# Usage: diff-process-backend --mode=<mode> [--log=<path>]
> > +#
> > +# Modes:
> > +#   whole-file  - report all lines as changed (default)
> > +#   fixed-hunk  - always report hunk 5 2 5 2
> > +#   bad-hunk    - report out-of-bounds hunk 999 1 999 1
> > +#   bad-sync    - report hunk with mismatched unchanged totals
> > +#   overlap     - report two overlapping hunks
> > +#   no-hunks   - return no hunks (files considered equivalent)
> > +#   error       - return status=error for every request
> > +#   abort       - return status=abort for every request
> > +#   crash       - read one request then exit without responding
> > +#
> > +setup_backend () {
> > +     cat >"$TRASH_DIRECTORY/diff-process-backend.py" <<-\PYEOF
> > +     import sys, os
> > +
> > +     def read_pkt():
> > +         hdr = sys.stdin.buffer.read(4)
> > +         if len(hdr) < 4: return None
> > +         length = int(hdr, 16)
> > +         if length == 0: return ""
> > +         data = sys.stdin.buffer.read(length - 4)
> > +         return data.decode().rstrip("\n")
> > +
> > +     def write_pkt(line):
> > +         data = (line + "\n").encode()
> > +         sys.stdout.buffer.write(f"{len(data)+4:04x}".encode() + data)
> > +         sys.stdout.buffer.flush()
> > +
> > +     def write_flush():
> > +         sys.stdout.buffer.write(b"0000")
> > +         sys.stdout.buffer.flush()
> > +
> > +     def read_content():
> > +         chunks = []
> > +         while True:
> > +             hdr = sys.stdin.buffer.read(4)
> > +             if len(hdr) < 4: break
> > +             length = int(hdr, 16)
> > +             if length == 0: break
> > +             chunks.append(sys.stdin.buffer.read(length - 4))
> > +         return b"".join(chunks)
> > +
> > +     mode = "whole-file"
> > +     logfile = None
> > +     for arg in sys.argv[1:]:
> > +         if arg.startswith("--mode="):
> > +             mode = arg[7:]
> > +         elif arg.startswith("--log="):
> > +             logfile = open(arg[6:], "a")
> > +
> > +     def log(msg):
> > +         if logfile:
> > +             logfile.write(msg + "\n")
> > +             logfile.flush()
> > +
> > +     # Handshake
> > +     assert read_pkt() == "git-diff-client"
> > +     assert read_pkt() == "version=1"
> > +     read_pkt()
> > +     write_pkt("git-diff-server")
> > +     write_pkt("version=1")
> > +     write_flush()
> > +     while True:
> > +         p = read_pkt()
> > +         if p == "": break
> > +     write_pkt("capability=hunks")
> > +     write_flush()
> > +
> > +     log("ready")
> > +
> > +     while True:
> > +         cmd = None
> > +         pathname = None
> > +         while True:
> > +             p = read_pkt()
> > +             if p is None: sys.exit(0)
> > +             if p == "": break
> > +             if p.startswith("command="): cmd = p.split("=",1)[1]
> > +             if p.startswith("pathname="): pathname = p.split("=",1)[1]
> > +         if cmd is None: sys.exit(0)
> > +         old = read_content()
> > +         new = read_content()
> > +         old_first = old.split(b"\n")[0].decode(errors="replace") if old else ""
> > +         new_first = new.split(b"\n")[0].decode(errors="replace") if new else ""
> > +         log(f"command={cmd} pathname={pathname} old={old_first} new={new_first}")
> > +
> > +         if mode == "error":
> > +             write_flush()
> > +             write_pkt("status=error")
> > +             write_flush()
> > +             continue
> > +
> > +         if mode == "abort":
> > +             write_flush()
> > +             write_pkt("status=abort")
> > +             write_flush()
> > +             continue
> > +
> > +         if mode == "crash":
> > +             sys.exit(1)
> > +
> > +         if cmd == "hunks":
> > +             if mode == "fixed-hunk":
> > +                 write_pkt("hunk 5 2 5 2")
> > +             elif mode == "bad-hunk":
> > +                 write_pkt("hunk 999 1 999 1")
> > +             elif mode == "bad-sync":
> > +                 write_pkt("hunk 1 2 1 1")
> > +             elif mode == "overlap":
> > +                 write_pkt("hunk 1 5 1 5")
> > +                 write_pkt("hunk 3 2 3 2")
> > +             elif mode == "no-hunks":
> > +                 pass
> > +             else:
> > +                 ol = old.count(b"\n")
> > +                 nl = new.count(b"\n")
> > +                 write_pkt(f"hunk 1 {ol} 1 {nl}")
> > +             write_flush()
> > +             write_pkt("status=success")
> > +             write_flush()
> > +         else:
> > +             write_flush()
> > +             write_pkt("status=error")
> > +             write_flush()
> > +     PYEOF
>
> The existing pattern is to provide larger scripts as fixtures in
> associated `t/tNNNN/` directories, not as heredoc, see e.g.
> `t/t1509/prepare-chroot.sh`. Writing scripts, especially lengthy ones, in
> heredoc strings makes it virtually impossible to use static code analysis
> or syntax highlighting to fend off banal errors.
>
> Given the complexity of what t4080 tries to test (error, abort, crash,
> bad-sync, no-hunks, multiple files in one session, capability
> negotiation), it would unfortunately be infeasible to use `test-tool
> pkt-line` from a shell script implementing that `diff.*.process` protocol.
>
> So I've spiked a demo how the `test-tool diff-process-backend` could look
> like (letting Opus do the menial typing, so that I can enjoy at least part
> of a sunny Sunday outside), which also passes the CI build and test:
> https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6
>
> That commit is of course not intended to be used as-is; Feel free to pick
> code parts of it and integrate them into your topic branch. Or write your
> own test-tool helper from scratch if that's more your jam.
>
> Ciao,
> Johannes
>
> > +     write_script diff-process-backend <<-SHEOF
> > +     exec "$PYTHON_PATH" "$TRASH_DIRECTORY/diff-process-backend.py" "\$@"
> > +     SHEOF
> > +}
> > +
> > +BACKEND="./diff-process-backend"
> > +
> > +test_expect_success PYTHON 'setup' '
> > +     setup_backend &&
> > +     echo "*.c diff=cdiff" >.gitattributes &&
> > +     git add .gitattributes &&
> > +
> > +     # boundary.c: 10 lines, changes at 5-6 and 9-10.
> > +     # Used by: hunk boundaries, error fallback, crash, bad hunks, overlap.
> > +     cat >boundary.c <<-\EOF &&
> > +     line1
> > +     line2
> > +     line3
> > +     line4
> > +     OLD5
> > +     OLD6
> > +     line7
> > +     line8
> > +     OLD9
> > +     OLD10
> > +     EOF
> > +     git add boundary.c &&
> > +
> > +     # worddiff.c: single-line function, value changes 1 -> 999.
> > +     # Used by: word-diff, --diff-algorithm, --no-ext-diff, --stat.
> > +     cat >worddiff.c <<-\EOF &&
> > +     int value(void) { return 1; }
> > +     EOF
> > +     git add worddiff.c &&
> > +
> > +     # newfile.c: single-line function, value changes 42 -> 99.
> > +     # Used by: new file, --exit-code, multiple drivers.
> > +     cat >newfile.c <<-\EOF &&
> > +     int new_func(void) { return 42; }
> > +     EOF
> > +     git add newfile.c &&
> > +
> > +     # logtest.c: single-line function for log/format-patch tests.
> > +     # Needs two commits so log -1 has a diff.
> > +     cat >logtest.c <<-\EOF &&
> > +     int logfunc(void) { return 1; }
> > +     EOF
> > +     git add logtest.c &&
> > +
> > +     # two.c/one.c: two-file pair for error/abort/startup-failure tests.
> > +     cat >one.c <<-\EOF &&
> > +     int first(void) { return 1; }
> > +     EOF
> > +     cat >two.c <<-\EOF &&
> > +     int second(void) { return 2; }
> > +     EOF
> > +     git add one.c two.c &&
> > +
> > +     git commit -m "initial" &&
> > +
> > +     # Second commit for logtest.c (so log -1 has something to show).
> > +     cat >logtest.c <<-\EOF &&
> > +     int logfunc(void) { return 2; }
> > +     EOF
> > +     git add logtest.c &&
> > +     git commit -m "change logtest.c" &&
> > +
> > +     # Working tree modifications (not committed).
> > +     cat >boundary.c <<-\EOF &&
> > +     line1
> > +     line2
> > +     line3
> > +     line4
> > +     NEW5
> > +     NEW6
> > +     line7
> > +     line8
> > +     NEW9
> > +     NEW10
> > +     EOF
> > +
> > +     cat >worddiff.c <<-\EOF &&
> > +     int value(void) { return 999; }
> > +     EOF
> > +
> > +     cat >newfile.c <<-\EOF &&
> > +     int new_func(void) { return 99; }
> > +     EOF
> > +
> > +     cat >one.c <<-\EOF &&
> > +     int first(void) { return 10; }
> > +     EOF
> > +
> > +     cat >two.c <<-\EOF
> > +     int second(void) { return 20; }
> > +     EOF
> > +'
> > +
> > +#
> > +# Core behavior: the tool controls which lines are marked as changed.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process hunk boundaries affect output' '
> > +     # The file has changes at lines 5-6 and 9-10, but fixed-hunk
> > +     # only reports lines 5-6 as changed.  Lines 9-10 should not
> > +     # appear as changed in the output.
> > +     git -c diff.cdiff.process="$BACKEND --mode=fixed-hunk" \
> > +             diff boundary.c >actual &&
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^-OLD6" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^+NEW6" actual &&
> > +     test_grep ! "^-OLD9" actual &&
> > +     test_grep ! "^-OLD10" actual &&
> > +     test_grep ! "^+NEW9" actual &&
> > +     test_grep ! "^+NEW10" actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works with new file' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- newfile.c >actual 2>stderr &&
> > +     test_grep "return 99" actual &&
> > +     test_grep "pathname=newfile.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works with added file (empty old side)' '
> > +     cat >added.c <<-\EOF &&
> > +     int added(void) { return 1; }
> > +     EOF
> > +     git add added.c &&
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --cached -- added.c >actual 2>stderr &&
> > +     test_grep "added" actual &&
> > +     test_grep "pathname=added.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process skipped for binary files' '
> > +     printf "\\0binary" >binary.c &&
> > +     git add binary.c &&
> > +     git commit -m "add binary" &&
> > +     printf "\\0changed" >binary.c &&
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- binary.c >actual &&
> > +     test_grep "Binary files" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'diff process not consulted for unmatched driver' '
> > +     echo "not tracked by cdiff" >unmatched.txt &&
> > +     git add unmatched.txt &&
> > +     git commit -m "add unmatched.txt" &&
> > +
> > +     echo "modified" >unmatched.txt &&
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- unmatched.txt >actual &&
> > +     test_grep "modified" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'multiple drivers use separate processes' '
> > +     echo "*.h diff=hdiff" >>.gitattributes &&
> > +     git add .gitattributes &&
> > +
> > +     cat >multi.h <<-\EOF &&
> > +     int header(void) { return 1; }
> > +     EOF
> > +     git add multi.h &&
> > +     git commit -m "add multi.h" &&
> > +
> > +     cat >multi.h <<-\EOF &&
> > +     int header(void) { return 2; }
> > +     EOF
> > +
> > +     rm -f backend-c.log backend-h.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend-c.log" \
> > +         -c diff.hdiff.process="$BACKEND --log=backend-h.log" \
> > +             diff -- newfile.c multi.h >actual 2>stderr &&
> > +     test_grep "pathname=newfile.c" backend-c.log &&
> > +     test_grep "pathname=multi.h" backend-h.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works alongside textconv' '
> > +     write_script uppercase-filter <<-\EOF &&
> > +     tr "a-z" "A-Z" <"$1"
> > +     EOF
> > +
> > +     cat >textconv.c <<-\EOF &&
> > +     hello world
> > +     EOF
> > +     git add textconv.c &&
> > +     git commit -m "add textconv.c" &&
> > +
> > +     cat >textconv.c <<-\EOF &&
> > +     goodbye world
> > +     EOF
> > +
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.textconv="./uppercase-filter" \
> > +         -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff -- textconv.c >actual 2>stderr &&
> > +     # The diff process receives textconv-transformed (uppercase) content.
> > +     test_grep "pathname=textconv.c" backend.log &&
> > +     test_grep "old=HELLO WORLD" backend.log &&
> > +     test_grep "new=GOODBYE WORLD" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +#
> > +# Downstream features: word diff, log, equivalent files, exit code.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process with --word-diff' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --word-diff worddiff.c >actual 2>stderr &&
> > +     test_grep "\[-1;-\]" actual &&
> > +     test_grep "{+999;+}" actual &&
> > +     test_grep "pathname=worddiff.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process works with git log -p' '
> > +     # With no-hunks mode, the tool says the files are equivalent,
> > +     # so log -p should show the commit but no diff content.
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=no-hunks --log=backend.log" \
> > +             log -1 -p -- logtest.c >actual 2>stderr &&
> > +     test_grep "change logtest.c" actual &&
> > +     test_grep ! "return 2" actual &&
> > +     test_grep "command=hunks pathname=logtest.c" backend.log &&
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process no hunks suppresses diff output' '
> > +     cat >nohunks.c <<-\EOF &&
> > +     int zero(void) { return 0; }
> > +     EOF
> > +     git add nohunks.c &&
> > +     git commit -m "add nohunks.c" &&
> > +
> > +     cat >nohunks.c <<-\EOF &&
> > +     int zero(void) { return 999; }
> > +     EOF
> > +
> > +     git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \
> > +             diff nohunks.c >actual &&
> > +     test_must_be_empty actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process no hunks with --exit-code returns success' '
> > +     git -c diff.cdiff.process="$BACKEND --mode=no-hunks" \
> > +             diff --exit-code nohunks.c
> > +'
> > +
> > +test_expect_success PYTHON 'diff process with --exit-code and hunks returns failure' '
> > +     test_expect_code 1 git -c diff.cdiff.process="$BACKEND" \
> > +             diff --exit-code newfile.c
> > +'
> > +
> > +#
> > +# Bypass mechanisms: flags and commands that skip the diff process.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process bypassed by --diff-algorithm' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --diff-algorithm=patience worddiff.c >actual &&
> > +     test_grep "return 999" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'diff process not used by --stat' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --log=backend.log" \
> > +             diff --stat worddiff.c >actual &&
> > +     test_grep "worddiff.c" actual &&
> > +     test_path_is_missing backend.log
> > +'
> > +
> > +#
> > +# Error handling and fallback.
> > +#
> > +
> > +test_expect_success PYTHON 'diff process fallback on tool error status' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     # Fallback produces the full builtin diff (both change regions).
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^-OLD9" actual &&
> > +     test_grep "^+NEW9" actual &&
> > +     # Tool was contacted (it replied with error, not crash).
> > +     test_grep "command=hunks pathname=boundary.c" backend.log &&
> > +     test_grep "diff process.*failed" stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process error keeps tool available for next file' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=error --log=backend.log" \
> > +             diff -- one.c two.c >actual 2>stderr &&
> > +     # Unlike abort, error keeps the tool available: both files
> > +     # are sent to the tool (and both fall back).
> > +     test_grep "pathname=one.c" backend.log &&
> > +     test_grep "pathname=two.c" backend.log &&
> > +     test_grep "return 10" actual &&
> > +     test_grep "return 20" actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process abort disables for session' '
> > +     rm -f backend.log &&
> > +     git -c diff.cdiff.process="$BACKEND --mode=abort --log=backend.log" \
> > +             diff -- one.c two.c >actual &&
> > +     # Both files should still produce diff output via fallback.
> > +     test_grep "return 10" actual &&
> > +     test_grep "return 20" actual &&
> > +     # The tool aborts on the first file and git clears its
> > +     # capability.  The second file never contacts the tool.
> > +     test_grep "pathname=one.c" backend.log &&
> > +     test_grep ! "pathname=two.c" backend.log
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on tool crash' '
> > +     git -c diff.cdiff.process="$BACKEND --mode=crash" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^-OLD9" actual &&
> > +     test_grep "^+NEW9" actual &&
> > +     # Crash is a communication failure, so a warning is emitted.
> > +     test_grep "diff process.*failed" stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process startup failure only warns once' '
> > +     git -c diff.cdiff.process="/nonexistent/tool" \
> > +             diff -- one.c two.c >actual 2>stderr &&
> > +     # Both files produce diff output via fallback.
> > +     test_grep "return 10" actual &&
> > +     test_grep "return 20" actual &&
> > +     # Sentinel prevents repeated warnings: only one, not one per file.
> > +     test_grep "diff process.*failed" stderr >warnings &&
> > +     test_line_count = 1 warnings
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on bad hunks' '
> > +     git -c diff.cdiff.process="$BACKEND --mode=bad-hunk" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     test_grep "^-OLD5" actual &&
> > +     test_grep "^+NEW5" actual &&
> > +     test_grep "^-OLD9" actual &&
> > +     test_grep "^+NEW9" actual &&
> > +     # Invalid hunks are caught by xdiff validation, not the
> > +     # protocol layer, so no warning is emitted.
> > +     test_must_be_empty stderr
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on mismatched unchanged totals' '
> > +     cat >synctest.c <<-\EOF &&
> > +     line1
> > +     line2
> > +     line3
> > +     EOF
> > +     git add synctest.c &&
> > +     git commit -m "add synctest.c" &&
> > +
> > +     cat >synctest.c <<-\EOF &&
> > +     line1
> > +     changed
> > +     line3
> > +     EOF
> > +
> > +     # bad-sync reports hunk 1 2 1 1: marks 2 old lines and 1 new
> > +     # line as changed, leaving 1 unchanged old vs 2 unchanged new.
> > +     # The synchronization invariant fails and git falls back.
> > +     git -c diff.cdiff.process="$BACKEND --mode=bad-sync" \
> > +             diff synctest.c >actual 2>stderr &&
> > +     test_grep "changed" actual
> > +'
> > +
> > +test_expect_success PYTHON 'diff process fallback on overlapping hunks' '
> > +     # boundary.c has 10 lines, so both hunks are in bounds
> > +     # but they overlap at lines 3-5, triggering the ordering check.
> > +     git -c diff.cdiff.process="$BACKEND --mode=overlap" \
> > +             diff boundary.c >actual 2>stderr &&
> > +     test_grep "NEW5" actual
> > +'
> > +
> > +test_done
> > diff --git a/userdiff.h b/userdiff.h
> > index 51c26e0d41..a98eabe377 100644
> > --- a/userdiff.h
> > +++ b/userdiff.h
> > @@ -3,6 +3,7 @@
> >
> >  #include "notes-cache.h"
> >
> > +struct diff_subprocess;
> >  struct index_state;
> >  struct repository;
> >
> > @@ -33,6 +34,8 @@ struct userdiff_driver {
> >       int textconv_want_cache;
> >       const char *process;
> >       char *process_owned;
> > +     struct diff_subprocess *diff_subprocess;
> > +     unsigned diff_process_failed : 1;
> >  };
> >  enum userdiff_driver_type {
> >       USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
> > --
> > gitgitgadget
> >
> >
> >

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> So the conscious project direction has been: fold pkt-line test backends
> into `test-tool` and drop the scripting-language prereq. Reintroducing the
> same shape in Python would walk this back.
> ...
> The `PYTHON` prereq exists in exactly five files today, all `git p4`
> related (where Python is an intrinsic prerequisite given that `git-p4.py`
> _is_ written in Python): `t/lib-git-p4.sh`, `t/t9802-git-p4-filetype.sh`,
> `t/t9810-git-p4-rcs.sh`, `t/t9835-git-p4-metadata-encoding-python2.sh`,
> and `t/t9836-git-p4-metadata-encoding-python3.sh`.
> ...
> That commit is of course not intended to be used as-is; Feel free to pick
> code parts of it and integrate them into your topic branch. Or write your
> own test-tool helper from scratch if that's more your jam.

Showing better direction to new folks with such a clear thinking is
very much appreciated.  Even though it is natural and perfectly OK
for tests that interacts with parts of Git that are written in these
languages (e.g., we are OK for gitweb tests to require Perl), we
should consciously keep ourselves clean and not adding unnecessary
dependencies.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Michael Montalbo <mmontalbo@gmail.com> writes:

> On Sun, Jun 7, 2026 at 7:36 AM Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>>
>> Hi Michael,
>>
>> I stumbled about this patch when it broke CI in Git for Windows, where we
>> do _not_ use `NO_PYTHON`, even though Python is unavailable in the
>> build/test CI jobs. The existing tests handle this situation gracefully,
>> this here patch does not:
>> ...
>> Given the complexity of what t4080 tries to test (error, abort, crash,
>> bad-sync, no-hunks, multiple files in one session, capability
>> negotiation), it would unfortunately be infeasible to use `test-tool
>> pkt-line` from a shell script implementing that `diff.*.process` protocol.
>>
>> So I've spiked a demo how the `test-tool diff-process-backend` could look
>> like (letting Opus do the menial typing, so that I can enjoy at least part
>> of a sunny Sunday outside), which also passes the CI build and test:
>> https://github.com/dscho/git/commit/b6e3c93381b00929476c3a00155f7cf7334a22e6
>>
>> That commit is of course not intended to be used as-is; Feel free to pick
>> code parts of it and integrate them into your topic branch. Or write your
>> own test-tool helper from scratch if that's more your jam.
>>
>
> Johannes, thank you for the great feedback. The historical context is
> really helpful and
> the concerns you raise make a lot of sense. I will take a look at your
> spike and also work
> on removing Python from the test.

Another request.

Please do not force readers to scroll through a ~800 line message
just to read only 5 lines of response from you.  Keep relevant parts
of the message you are responding to in your message to help readers
understand the context in which your response was made, but trim
everything else that is not relevant from your quote.

Thanks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

Michael Montalbo <mmontalbo@gmail.com> writes:

>> So the conscious project direction has been: fold pkt-line test backends
>> into `test-tool` and drop the scripting-language prereq. Reintroducing the
>> same shape in Python would walk this back.
>>
>> Patrick's careful effort in 27bd8ee311719 (Merge branch 'ps/fewer-perl',
>> 2025-04-29) has been another clear sign that the Git project is actively
>> _removing_ scripting-language dependencies from the build and test
>> infrastructure, not adding new ones.
>
> Now I wonder if the extension / addition of more Perl test infra with my other
> series:
>
> https://lore.kernel.org/git/pull.2135.git.1780559158.gitgitgadget@gmail.com/T/
>
> also goes against the project direction w.r.t. removing scripting languages.
> Maybe I should also re-evaluate the usage of Perl there. I am leveraging
> existing shell parsing logic written in Perl, but maybe the preference for
> Perl-based lint rules is a mistake and should be avoided.

That sounds prudent (even though it is outside the scope of _this_
topic, of course).

Thanks.

conversion outputs. See linkgit:gitattributes[5] for details.

`diff.<driver>.process`::
The command to run as a long-running diff process that
provides hunks to Git's diff pipeline.
See linkgit:gitattributes[5] for details.

`diff.indentHeuristic`::
Set this option to `false` to disable the default heuristics
that shift diff hunk boundaries to make patches easier to read.
Expand Down
3 changes: 3 additions & 0 deletions Documentation/diff-algorithm-option.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,6 @@
For instance, if you configured the `diff.algorithm` variable to a
non-default value and want to use the default one, then you
have to use `--diff-algorithm=default` option.
+
If you explicitly choose a diff algorithm, it also bypasses
`diff.<driver>.process` (see linkgit:gitattributes[5]).
4 changes: 3 additions & 1 deletion Documentation/diff-options.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -825,7 +825,9 @@ endif::git-format-patch[]
to use this option with linkgit:git-log[1] and friends.

`--no-ext-diff`::
Disallow external diff drivers.
Disallow external diff helpers, including
`diff.<driver>.command` and `diff.<driver>.process`
(see linkgit:gitattributes[5]).

`--textconv`::
`--no-textconv`::
Expand Down
139 changes: 139 additions & 0 deletions Documentation/gitattributes.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -821,6 +821,145 @@ NOTE: If `diff.<name>.command` is defined for path with the
(see above), and adding `diff.<name>.algorithm` has no effect, as the
algorithm is not passed to the external diff driver.

Using an external diff process
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If `diff.<name>.process` is defined, Git sends the old and new file
content to an external tool and receives back a list of changed
regions (pairs of line ranges in the old and new file). Git uses
these instead of its builtin diff algorithm, but still controls
all output formatting, so features like word diff, function context,
color, and blame work normally. This is achieved by using the
long-running process protocol (described in
Documentation/technical/long-running-process-protocol.adoc).
Unlike `diff.<name>.command`, which replaces Git's output entirely,
the diff process feeds results back into the standard pipeline.

First, in `.gitattributes`, assign the `diff` attribute for paths.

------------------------
*.c diff=cdiff
------------------------

Then, define a "diff.<name>.process" configuration to specify
the diff process command.

----------------------------------------------------------------
[diff "cdiff"]
process = /path/to/diff-process-tool
----------------------------------------------------------------

When Git encounters the first file that needs to be diffed, it starts
the process and performs the handshake. In the handshake, the welcome
message sent by Git is "git-diff-client", only version 1 is supported,
and the supported capability is "hunks" (the changed regions
described below).

For each file, Git sends a list of "key=value" pairs terminated with
a flush packet, followed by the old and new file content as packetized
data, each terminated with a flush packet. The pathname is relative
to the repository root. When `diff.<name>.textconv` is also set,
the tool receives the textconv-transformed content rather than the
raw blob. Git does not send binary files to the diff process.

-----------------------
packet: git> command=hunks
packet: git> pathname=path/file.c
packet: git> 0000
packet: git> OLD_CONTENT
packet: git> 0000
packet: git> NEW_CONTENT
packet: git> 0000
-----------------------

The tool is expected to respond with zero or more hunk lines,
a flush packet, and a status packet terminated with a flush packet.
Each hunk line has the form:

`hunk <old_start> <old_count> <new_start> <new_count>`

where `<old_start>` and `<old_count>` identify a range of lines in
the old file, and `<new_start>` and `<new_count>` identify the
replacement range in the new file. Start values are 1-based and
counts are non-negative. Ranges must not extend beyond the end of
the file. For example, `hunk 3 2 3 4` means that 2 lines starting
at line 3 in the old file were replaced by 4 lines starting at
line 3 in the new file. An `<old_count>` of 0 means no lines were
removed (pure insertion); a `<new_count>` of 0 means no lines were
added (pure deletion).

Lines are delimited by newlines. A file `"foo\nbar\n"` and a
file `"foo\nbar"` both have 2 lines.

Hunks must be listed in order and must not overlap. Any line
not covered by a hunk is treated as unchanged, so the total
number of unchanged lines must be the same on both sides.
For example, if the old file has 10 lines and the hunks cover
4 of them (`old_count` values summing to 4), then 6 old lines
are unchanged. The new file must also have exactly 6 lines
not covered by hunks, so the `new_count` values must sum to
`new_file_lines - 6`.

-----------------------
packet: git< hunk 1 3 1 5
packet: git< hunk 10 2 12 2
packet: git< 0000
packet: git< status=success
packet: git< 0000
-----------------------

If the tool responds with hunks and "success", Git marks those lines
as changed and feeds them into the standard diff pipeline. Patch
output features (word diff, function context, color) work normally.
Note that `--stat` and other summary formats use their own diff path
and are not affected by the diff process.

If no hunk lines precede the flush, followed by "success", Git
treats the files as having no changes: `git diff` produces no output
and `git blame` skips the commit, attributing lines to earlier commits.

-----------------------
packet: git< 0000
packet: git< status=success
packet: git< 0000
-----------------------

If the tool returns invalid hunks (out of bounds, overlapping), Git
silently falls back to the builtin diff algorithm.

In case the tool cannot or does not want to process the content,
it is expected to respond with an "error" status. Git warns and
falls back to the builtin diff algorithm for this file. The tool
remains available for subsequent files.

-----------------------
packet: git< 0000
packet: git< status=error
packet: git< 0000
-----------------------

In case the tool cannot or does not want to process the content as
well as any future content for the lifetime of the Git process, it
is expected to respond with an "abort" status. Git silently falls
back to the builtin diff algorithm for this file and does not send
further requests to the tool.

-----------------------
packet: git< 0000
packet: git< status=abort
packet: git< 0000
-----------------------

If the tool dies during the communication or does not adhere to the
protocol then Git will stop the process and fall back to the builtin
diff algorithm. Git warns once and does not restart the process for
subsequent files.

Tools should ignore unknown keys in the per-file request to remain
forward-compatible. Future versions of Git may send additional
`command=` values; tools that receive an unrecognized command should
respond with `status=error` rather than terminating.

Defining a custom hunk-header
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -815,6 +815,7 @@ TEST_BUILTINS_OBJS += test-csprng.o
TEST_BUILTINS_OBJS += test-date.o
TEST_BUILTINS_OBJS += test-delete-gpgsig.o
TEST_BUILTINS_OBJS += test-delta.o
TEST_BUILTINS_OBJS += test-diff-process-backend.o
TEST_BUILTINS_OBJS += test-dir-iterator.o
TEST_BUILTINS_OBJS += test-drop-caches.o
TEST_BUILTINS_OBJS += test-dump-cache-tree.o
Expand Down Expand Up @@ -1142,6 +1143,7 @@ LIB_OBJS += diff-delta.o
LIB_OBJS += diff-merges.o
LIB_OBJS += diff-lib.o
LIB_OBJS += diff-no-index.o
LIB_OBJS += diff-process.o
LIB_OBJS += diff.o
LIB_OBJS += diffcore-break.o
LIB_OBJS += diffcore-delta.o
Expand Down
40 changes: 31 additions & 9 deletions blame.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#include "tag.h"
#include "trace2.h"
#include "blame.h"
#include "diff-process.h"
#include "xdiff-interface.h"
#include "alloc.h"
#include "commit-slab.h"
#include "bloom.h"
Expand Down Expand Up @@ -314,17 +316,25 @@ static struct commit *fake_working_tree_commit(struct repository *r,



static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b,
xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts)
static int diff_hunks_xpp(mmfile_t *file_a, mmfile_t *file_b,
xdl_emit_hunk_consume_func_t hunk_func,
void *cb_data, xpparam_t *xpp)
{
xpparam_t xpp = {0};
xdemitconf_t xecfg = {0};
xdemitcb_t ecb = {NULL};

xpp.flags = xdl_opts;
xecfg.hunk_func = hunk_func;
ecb.priv = cb_data;
return xdi_diff(file_a, file_b, &xpp, &xecfg, &ecb);
return xdi_diff(file_a, file_b, xpp, &xecfg, &ecb);
}

static int diff_hunks(mmfile_t *file_a, mmfile_t *file_b,
xdl_emit_hunk_consume_func_t hunk_func, void *cb_data, int xdl_opts)
{
xpparam_t xpp = {0};

xpp.flags = xdl_opts;
return diff_hunks_xpp(file_a, file_b, hunk_func, cb_data, &xpp);
}

static const char *get_next_line(const char *start, const char *end)
Expand Down Expand Up @@ -1943,6 +1953,7 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb,
struct blame_origin *parent, int ignore_diffs)
{
mmfile_t file_p, file_o;
xpparam_t xpp = {0};
struct blame_chunk_cb_data d;
struct blame_entry *newdest = NULL;

Expand All @@ -1961,10 +1972,21 @@ static void pass_blame_to_parent(struct blame_scoreboard *sb,
&sb->num_read_blob, ignore_diffs);
sb->num_get_patch++;

if (diff_hunks(&file_p, &file_o, blame_chunk_cb, &d, sb->xdl_opts))
die("unable to generate diff (%s -> %s)",
oid_to_hex(&parent->commit->object.oid),
oid_to_hex(&target->commit->object.oid));
xpp.flags = sb->xdl_opts;
/*
* If the diff process considers the files equivalent,
* skip the diff so blame looks past this commit.
*/
if (diff_process_fill_hunks(&sb->revs->diffopt, target->path,
&file_p, &file_o, &xpp)
!= DIFF_PROCESS_EQUIVALENT) {
if (diff_hunks_xpp(&file_p, &file_o, blame_chunk_cb,
&d, &xpp))
die("unable to generate diff (%s -> %s)",
oid_to_hex(&parent->commit->object.oid),
oid_to_hex(&target->commit->object.oid));
}
free(xpp.external_hunks);
/* The rest are the same as the parent */
blame_chunk(&d.dstq, &d.srcq, INT_MAX, d.offset, INT_MAX, 0,
parent, target, 0);
Expand Down
7 changes: 7 additions & 0 deletions builtin/log.c
Original file line number Diff line number Diff line change
Expand Up @@ -2213,6 +2213,13 @@ int cmd_format_patch(int argc,
if (argc > 1)
die(_("unrecognized argument: %s"), argv[1]);

/*
* Disable diff.<driver>.process so that patches generated by
* format-patch are always based on the builtin diff algorithm
* and can be applied reliably.
*/
rev.diffopt.flags.no_diff_process = 1;

if (rev.diffopt.output_format & DIFF_FORMAT_NAME)
die(_("--name-only does not make sense"));
if (rev.diffopt.output_format & DIFF_FORMAT_NAME_STATUS)
Expand Down
Loading
Loading