Skip to content

docs: deduplicate built HTML images into a shared image/ tree#4157

Open
grandixximo wants to merge 1 commit into
LinuxCNC:masterfrom
grandixximo:docs-image-dedup
Open

docs: deduplicate built HTML images into a shared image/ tree#4157
grandixximo wants to merge 1 commit into
LinuxCNC:masterfrom
grandixximo:docs-image-dedup

Conversation

@grandixximo

Copy link
Copy Markdown
Contributor

The HTML doc build copies every referenced image into each language and topic directory, so the built tree carries the same bytes many times over (238 MB of images, only 30 MB unique).

This adds docs/src/tools/dedup-images.py and wires it into the htmldocs build: it collapses byte-identical images (SHA-256) into a shared root image/ tree, keeps only translated overrides under <lang>/image/, and rewrites every src and click-to-enlarge href. It is dry-run by default, idempotent, and self-verifying (after --apply it re-resolves every reference and fails if any is broken), and it preserves HTML mtimes so a second make htmldocs does no work.

On the full translated tree the build output goes from 325 MB to 119 MB (images 238 MB to 30 MB), with all references verified.

@BsAtHome I went with a post-build pass rather than doing this in image_resolver.rb. The resolver is PDF-only today, and the bulk of the duplication is plain untranslated English images that the _en/_<lang> logic never touches, so a single-source-of-truth approach would mean a bigger refactor of the image stamps to defer to the resolver. The size and build-time results are the same either way. Happy to take the DRY refactor route instead if you would prefer it.

The HTML doc build copies every referenced image into each language and
topic directory, so the built tree carries the same bytes many times over
(238 MB of images, only 30 MB unique).

Add docs/src/tools/dedup-images.py: it collapses byte-identical images
(SHA-256) into a shared root image/ tree, keeps only translated overrides
under <lang>/image/, and rewrites every src and click-to-enlarge href to
match. Dry-run by default, idempotent, and self-verifying: after --apply it
re-resolves every reference and fails if any is broken.

Wire it into the htmldocs build as a final .dedup-images-stamp step. The
tool preserves the mtime of every HTML file it rewrites, so a second
`make htmldocs` does no work.

On the full translated tree this takes the build output from 325 MB to
119 MB (images 238 MB to 30 MB) with all references verified.
@BsAtHome

Copy link
Copy Markdown
Contributor

I went with a post-build pass rather than doing this in image_resolver.rb.

When it works as intended without too much extra work,... why not. I trust you have weighed the options and went for the better one :-)

I'll have a look, later.

@grandixximo

Copy link
Copy Markdown
Contributor Author

Thanks. FWIW I did spend a fair bit of time weighing the build-integrated alternative before settling on the post-build pass.

This PR adds just one file and it is readable, but I agree the reason it has to exist is ugly: the build creates the duplicates and then this cleans them up.

The real alternative is not only refactoring the Ruby resolver but also shifting the build from zip to tar, so the resolver can place symlinks and have them survive. That preserves the dedup all the way through (artifact, deb, fetch), but it is a larger blast radius and would need coordination with @hdiethelm.

If we agree on that shape instead, the result is a more elegant structure, and I think genuinely better. Happy to go that way if you and @hdiethelm are on board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants