Skip to content

feat: sync post attachments (images, PDFs, PowerPoints) to S3#181

Open
bbornino wants to merge 2 commits into
playfulprogramming:mainfrom
bbornino:86-post-attachments
Open

feat: sync post attachments (images, PDFs, PowerPoints) to S3#181
bbornino wants to merge 2 commits into
playfulprogramming:mainfrom
bbornino:86-post-attachments

Conversation

@bbornino

@bbornino bbornino commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #86.

When a post is synced, any non-markdown files in that post's GitHub folder — images, PDFs, PowerPoints, and files nested in subfolders — are now uploaded to S3 and tracked in a new post_attachments table, with cleanup when files are removed from the repo or the post itself is deleted.

This is unrelated to the existing post_images table/task, which only covers generated banner and social-share images.

Schema

  • New post_attachments table (packages/db/src/schema/post-attachments.ts): postSlug (FK → posts.slug, cascade), attachmentName, attachmentKey, nullable width/height. Composite PK on (postSlug, attachmentName), following the postTags/postAuthors pattern (no synthetic id column).

sync-post processor

  • After the existing markdown-upload phase, a new phase recursively lists everything in the post's base folder (excluding index*.md locale files), including nested subfolders.
  • Each attachment is fetched via getContentsRawStream. Images (by extension) are resized and converted to JPEG, mirroring the approach in sync-author's profile image pipeline. Non-images are uploaded as-is with a small extension→MIME-type map.
  • S3 key naming for images: since the resize step converts the actual bytes to JPEG, the S3 key extension is normalized to .jpeg (e.g. banner.pngposts/{post}/attachments/banner.jpeg) rather than keeping the original extension, so nothing infers content-type from a mismatched extension. attachmentName in the database still stores the original filename exactly as it appeared in the repo. Non-image keys keep their original extension.
  • Diffing: previous post_attachments rows are compared against the freshly discovered set by attachmentName. Missing attachments are deleted from S3. For attachments still present, an MD5 hash of the final (post-resize, if applicable) content is compared against the existing S3 object's ETag (via the already-present but previously-unused s3.matchesEtag helper) — unchanged content is skipped, changed content is deleted and re-uploaded under the same key. The full new row set is written to post_attachments in the same transaction as the rest of the post's data.
  • In the existing 404 branch (post fully removed from the repo), the post's post_attachments rows are now fetched and deleted from S3 before the posts row is deleted, letting the cascading FK handle the database-side cleanup.

Tests

Added coverage in sync-post/processor.test.ts for: new attachment upload (including image resize and subfolder recursion), the diffing behavior (skip-unchanged / delete-and-reupload-changed / remove-deleted), and S3 cleanup in the post-deletion (404) cascade. Also extended test-utils/setup.ts with postAttachments and s3.exists/s3.matchesEtag mocks used by these tests.

Follow-ups (not addressed in this PR)

  • Shared image-resize helper: this PR adds its own local resize/convert-to-JPEG helper in sync-post/processor.ts rather than extracting the logic already duplicated in sync-author/processor.ts into a shared utility. Extracting that felt like a bigger refactor than this issue calls for — worth doing as its own follow-up.
  • Pre-existing gap, not introduced by this PR: the 404 branch (full post removal) still doesn't clean up the post's markdown content.md files from S3 for any locale — only the new attachment cleanup was added here. That gap predates this change and is worth its own follow-up.
  • Unhandled stream errors in image resizing: resizeAttachmentImage pipes Readable.fromWeb(stream) into the sharp pipeline without explicit error handling on the source stream, so a mid-transfer GitHub fetch failure could surface as an unhandled exception rather than a clean job failure. This same pattern already exists, unfixed, in sync-author's processProfileImg, which this code intentionally mirrors — worth fixing in both places together as its own follow-up rather than patching just the new copy here.

Test plan

  • pnpm test:unit passes (lint, knip, publint, sherif, vitest) — confirmed locally
  • pnpm prettier check passes — confirmed locally
  • Manual smoke test: sync a post with a real attachment folder locally against Minio and confirm objects land under posts/{post}/attachments/...

Summary by CodeRabbit

  • New Features

    • Post sync now includes attachment support, keeping uploaded files aligned with the content source.
    • Image attachments are automatically optimized for upload, including size limits and JPEG conversion when needed.
  • Bug Fixes

    • Removed attachments are now cleaned up automatically when posts change or are deleted.
    • Updated attachment handling to avoid re-uploading unchanged files while still refreshing modified ones.

Adds a post_attachments table and syncs non-markdown files (images, PDFs,
PowerPoints, nested subfolders) from a post's GitHub folder to S3, with
image resize/convert, content diffing, and cleanup on removal or post
deletion.
@coderabbitai

coderabbitai Bot commented Jul 5, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@bbornino, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 25 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdc47d3a-9e05-47fa-b401-445e7770d68f

📥 Commits

Reviewing files that changed from the base of the PR and between 2b1d7c7 and a96aed6.

📒 Files selected for processing (2)
  • apps/worker/src/tasks/sync-post/processor.test.ts
  • apps/worker/src/tasks/sync-post/processor.ts
📝 Walkthrough

Walkthrough

Adds a post_attachments database table and schema with a migration and cascading foreign key to posts. Extends the sync-post worker processor to discover, resize/upload, diff, and remove post attachments in S3, persisting attachment rows in the database, with corresponding test and mock updates.

Changes

Post attachments feature

Layer / File(s) Summary
Schema, migration, and export
packages/db/src/schema/post-attachments.ts, packages/db/src/schema/index.ts, packages/db/drizzle/20260705145304_curious_zuras/*
Adds postAttachments Drizzle table (composite primary key on postSlug/attachmentName, cascading FK to posts), matching SQL migration, generated snapshot, and schema re-export.
Attachment discovery & image processing helpers
apps/worker/src/tasks/sync-post/processor.ts
Adds recursive GitHub content traversal, image-extension detection, MIME-type mapping, sharp-based resizing to JPEG, and MD5-based ETag/hash generation for change detection.
Attachment removal on post deletion
apps/worker/src/tasks/sync-post/processor.ts, apps/worker/src/tasks/sync-post/processor.test.ts
Extends the 404 post-removal path to query existing attachment rows and remove each S3 object before deleting the post; adds test assertions for s3.remove calls.
Upload/diff/persist phase
apps/worker/src/tasks/sync-post/processor.ts, apps/worker/src/tasks/sync-post/processor.test.ts, apps/worker/test-utils/setup.ts
Adds a sync phase that discovers attachments, uploads/resizes new or changed ones (skipping unchanged via ETag match), removes stale S3 objects, and persists attachmentRows in the DB transaction; updates test mocks (s3.exists, s3.matchesEtag, mocked postAttachments table) and adds upload/diffing test cases.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Processor
  participant GitHub
  participant S3
  participant DB

  Processor->>GitHub: collectAttachmentEntries(post folder)
  Processor->>DB: select existing postAttachments rows
  Processor->>S3: matchesEtag / exists per attachment
  alt attachment changed or missing
    Processor->>S3: remove old object
    Processor->>S3: upload resized/converted attachment
  else unchanged
    Processor-->>Processor: skip upload
  end
  Processor->>S3: remove attachments no longer in repo
  Processor->>DB: delete old postAttachments rows, insert attachmentRows
Loading

Possibly related PRs

  • playfulprogramming/hoof#169: Modifies the same post-deletion flow in apps/worker/src/tasks/sync-post/processor.ts, overlapping with the attachment removal logic added here.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The changes implement attachment uploads, image processing, tracking in post_attachments, and S3 cleanup on removal or post deletion.
Out of Scope Changes check ✅ Passed The added schema, migration, processor logic, tests, and snapshot all support the attachment-sync objective and do not appear unrelated.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: syncing post attachments to S3.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
apps/worker/src/tasks/sync-post/processor.ts (1)

172-184: 🎯 Functional Correctness | 🔵 Trivial

Sequential S3 removal without failure isolation.

If s3.remove throws for one attachment mid-loop, the exception propagates, potentially leaving the post row undeleted while some S3 objects are already removed (partial state). This is likely self-healing on job retry (S3 deletes are typically idempotent), but consider Promise.allSettled or a try/catch-and-continue to avoid one bad key blocking the rest of the cleanup and the eventual post deletion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/worker/src/tasks/sync-post/processor.ts` around lines 172 - 184, The
attachment cleanup in sync-post processing is doing sequential S3 deletes with
no failure isolation, so one bad key can stop the rest of the cleanup and block
the eventual post deletion. Update the removal loop in processor.ts around the
removedAttachmentRows handling to make each s3.remove call fault-tolerant,
either by using Promise.allSettled or by wrapping each delete in a try/catch and
continuing. Keep the existing bucket lookup and logging behavior, but ensure the
cleanup continues for all attachmentKey values even if one delete fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/worker/src/tasks/sync-post/processor.test.ts`:
- Around line 1090-1097: The “unchanged attachment was NOT re-uploaded”
assertion in processor.test.ts is vacuous because the matcher uses
expect.anything() for arguments where s3.upload passes literal undefined, so it
can never match a real call. Update the check around the s3.upload expectation
to match the actual call signature used by the sync-post processor, especially
the third argument in s3.upload, and verify the unchanged.txt key against the
correct arguments so the assertion truly fails if upload is invoked. Use the
s3.upload call site in processor.ts and the existing test around unchanged
attachment handling to align the matcher with the real invocation shape.

In `@apps/worker/src/tasks/sync-post/processor.ts`:
- Around line 74-88: The resizeAttachmentImage helper is upscaling small inputs
because sharp().resize uses fit: "inside" without disabling enlargement. Update
resizeAttachmentImage in processor.ts to treat ATTACHMENT_IMAGE_SIZE_MAX as a
true cap by preventing enlargement when the source image is smaller, then adjust
the processor.test.ts expectations for the 1×1 fixture since width and height
should remain at the original small size instead of 2048×2048.
- Around line 74-88: The attachment resizing flow in resizeAttachmentImage
currently pipes Readable.fromWeb(stream) into the sharp pipeline without
explicit error handling, which can let source stream failures crash the worker.
Update this helper to use stream.pipeline or stream/promises.pipeline with the
existing sharp pipeline so errors are propagated as job failures, or add an
explicit error handler on the source stream; keep the fix localized to
resizeAttachmentImage in processor.ts.
- Around line 368-380: The changed-attachment flow in processor.ts deletes the
old S3 object before uploading the replacement, but previous.attachmentKey
usually matches attachmentKey because it is deterministically derived from
relativePath and isImage. Update the sync logic to skip the s3.remove call when
the key is unchanged and let s3.upload in the existing upload path overwrite the
object directly; keep the delete only for cases where previous.attachmentKey
differs from the new key.
- Around line 327-390: The attachment processing in processor.ts is using
Promise.all over attachmentEntries, which can buffer and upload every file at
once and spike memory/rate usage. Update the attachmentRows flow to use bounded
concurrency or batching instead of unbounded Promise.all, while keeping the
per-attachment logic inside the map callback that fetches via
github.getContentsRawStream, handles resizeAttachmentImage, and uploads through
s3.upload. Use a limiter or chunked processing so only a small number of
attachments are buffered in memory simultaneously.
- Around line 344-358: The image upload path in processor.ts normalizes every
image to the same .jpeg key, which can make different attachments with the same
basename overwrite each other in S3. Update the attachment handling in the image
branch of the sync-post processor so the storage key preserves uniqueness,
either by keeping the original relativePath/extension in keyRelativePath or by
detecting and rejecting duplicate basenames before the Promise.all upload. Make
sure the fix is applied where resizeAttachmentImage and withExtension are used
so each attachment row maps to a distinct object key.

---

Nitpick comments:
In `@apps/worker/src/tasks/sync-post/processor.ts`:
- Around line 172-184: The attachment cleanup in sync-post processing is doing
sequential S3 deletes with no failure isolation, so one bad key can stop the
rest of the cleanup and block the eventual post deletion. Update the removal
loop in processor.ts around the removedAttachmentRows handling to make each
s3.remove call fault-tolerant, either by using Promise.allSettled or by wrapping
each delete in a try/catch and continuing. Keep the existing bucket lookup and
logging behavior, but ensure the cleanup continues for all attachmentKey values
even if one delete fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 799b3a89-6552-494c-a4c8-5f9f2a762ce2

📥 Commits

Reviewing files that changed from the base of the PR and between 086e182 and 2b1d7c7.

📒 Files selected for processing (7)
  • apps/worker/src/tasks/sync-post/processor.test.ts
  • apps/worker/src/tasks/sync-post/processor.ts
  • apps/worker/test-utils/setup.ts
  • packages/db/drizzle/20260705145304_curious_zuras/migration.sql
  • packages/db/drizzle/20260705145304_curious_zuras/snapshot.json
  • packages/db/src/schema/index.ts
  • packages/db/src/schema/post-attachments.ts

Comment thread apps/worker/src/tasks/sync-post/processor.test.ts
Comment thread apps/worker/src/tasks/sync-post/processor.ts
Comment thread apps/worker/src/tasks/sync-post/processor.ts Outdated
Comment thread apps/worker/src/tasks/sync-post/processor.ts
Comment thread apps/worker/src/tasks/sync-post/processor.ts
…fulprogramming#181)

Makes 404-branch S3 cleanup resilient to individual failures, caps
resize to avoid upscaling small images, batches attachment uploads to
a concurrency limit, avoids same-basename key collisions between
converted images and their originals, skips a needless delete when an
attachment's key doesn't change, and fixes a vacuous test assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upload "post attachments" in sync-post task

1 participant