docs: deduplicate built HTML images into a shared image/ tree#4157
docs: deduplicate built HTML images into a shared image/ tree#4157grandixximo wants to merge 1 commit into
Conversation
The HTML doc build copies every referenced image into each language and topic directory, so the built tree carries the same bytes many times over (238 MB of images, only 30 MB unique). Add docs/src/tools/dedup-images.py: it collapses byte-identical images (SHA-256) into a shared root image/ tree, keeps only translated overrides under <lang>/image/, and rewrites every src and click-to-enlarge href to match. Dry-run by default, idempotent, and self-verifying: after --apply it re-resolves every reference and fails if any is broken. Wire it into the htmldocs build as a final .dedup-images-stamp step. The tool preserves the mtime of every HTML file it rewrites, so a second `make htmldocs` does no work. On the full translated tree this takes the build output from 325 MB to 119 MB (images 238 MB to 30 MB) with all references verified.
When it works as intended without too much extra work,... why not. I trust you have weighed the options and went for the better one :-) I'll have a look, later. |
|
Thanks. FWIW I did spend a fair bit of time weighing the build-integrated alternative before settling on the post-build pass. This PR adds just one file and it is readable, but I agree the reason it has to exist is ugly: the build creates the duplicates and then this cleans them up. The real alternative is not only refactoring the Ruby resolver but also shifting the build from zip to tar, so the resolver can place symlinks and have them survive. That preserves the dedup all the way through (artifact, deb, fetch), but it is a larger blast radius and would need coordination with @hdiethelm. If we agree on that shape instead, the result is a more elegant structure, and I think genuinely better. Happy to go that way if you and @hdiethelm are on board. |
The HTML doc build copies every referenced image into each language and topic directory, so the built tree carries the same bytes many times over (238 MB of images, only 30 MB unique).
This adds
docs/src/tools/dedup-images.pyand wires it into thehtmldocsbuild: it collapses byte-identical images (SHA-256) into a shared rootimage/tree, keeps only translated overrides under<lang>/image/, and rewrites everysrcand click-to-enlargehref. It is dry-run by default, idempotent, and self-verifying (after--applyit re-resolves every reference and fails if any is broken), and it preserves HTML mtimes so a secondmake htmldocsdoes no work.On the full translated tree the build output goes from 325 MB to 119 MB (images 238 MB to 30 MB), with all references verified.
@BsAtHome I went with a post-build pass rather than doing this in
image_resolver.rb. The resolver is PDF-only today, and the bulk of the duplication is plain untranslated English images that the_en/_<lang>logic never touches, so a single-source-of-truth approach would mean a bigger refactor of the image stamps to defer to the resolver. The size and build-time results are the same either way. Happy to take the DRY refactor route instead if you would prefer it.