feat: normalize genre variants to a canonical form#962
Merged
Conversation
Collapse genre spelling variants (e.g. "Hip Hop"/"hip-hop"/"hiphop", "r&b"/"rnb") to a single canonical name across the read paths. - Add NormalizeGenre() (api/genre_normalize.go): trims + collapses whitespace, title-cases, and maps known special cases (R&B, EDM, DJ, Hip Hop, Drum & Bass, ...) that should not be plain title-cased. - /v1/genres/popular: normalize names and merge + sum counts for variants that collapse to the same canonical name, then re-sort and apply min_count to the merged totals. - Normalize the `genre` filter param on the trending, underground, latest, and users/genre/top endpoints so case/whitespace variants match the canonical stored value. - Add TestGenreNormalize covering trim, casing, hip-hop/hiphop -> Hip Hop, r&b -> R&B, and already-canonical pass-through. The Go service does not write the tracks.genre column (it is populated upstream by the discovery provider), so normalization is applied on read. Fully collapsing variants at rest requires normalizing on write upstream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Normalizing the `genre` query param broke filtering: the tracks.genre column is written upstream (discovery provider) and is NOT normalized at rest, so canonicalizing the param no longer matches the raw stored value. CI surfaced this via TestGetLatestWithGenre — a query for "LatestTestGenreA" was title-cased to "Latesttestgenrea" and matched zero rows. Revert NormalizeGenre on the trending, underground, latest, and users/genre/top filter params. Keep normalization only on the /v1/genres/popular response, which is a display-layer transform that does not depend on matching stored values. Fully collapsing variants for filtering requires normalizing genre on write upstream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NormalizeGenre now emits the protocol's canonical genre spellings as defined by go-openaudio's GenreAllowlist (the upstream ETL genre write path), so the API's normalized output matches what the indexer treats as canonical: - "Hip Hop" -> "Hip-Hop/Rap" - "R&B" -> "R&B/Soul" - Drum & Bass, Lo-Fi unchanged (already allowlist forms) Drop the speculative K-Pop/J-Pop entries that are not in the allowlist; they now fall through to generic title-casing. EDM/DJ are kept purely as acronym casing fixes (not allowlist genres). Test assertions updated to the new canonical forms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dylanjeffers
added a commit
that referenced
this pull request
Jun 18, 2026
## What Bumps both go-openaudio pins to the merge commit of **go-openaudio #367** (`5aa118b67bc19474a56f81a6665f2f47854d72e4`), which normalizes genre at the **ETL write path** (`entity_manager` track create/update). ``` github.com/OpenAudio/go-openaudio v1.3.1-...-7758ae709d18 -> v1.4.1-0.20260618012656-5aa118b67bc1 github.com/OpenAudio/go-openaudio/pkg/etl v1.3.1-...-7758ae709d18 -> v1.4.1-0.20260618012656-5aa118b67bc1 ``` Both pins are kept in lockstep (the root module and the `/pkg/etl` submodule), as before. ## Why This repo runs the production indexer by vendoring go-openaudio's ETL (`indexer/indexer.go` → `etl.Indexer.Run()`); it does not implement track indexing itself. #367 makes the indexer write **canonical genre values at rest**, so genre variants (`hip-hop`/`hiphop` → `Hip-Hop/Rap`, etc.) now collapse on write upstream. That is the durable fix behind the read-side, display-only stopgap merged in #962 — and unlike the API read path, write-side normalization also fixes genre **filtering**. ## Verification - `go get` both modules @ the SHA, then `go mod tidy` — only `go.mod`/`go.sum` changed. - `go build ./...` — passes - `go vet ./...` — passes ## Heads-up for reviewers With the indexer now normalizing genre, the ETL parity check against the legacy Python discovery-provider may begin flagging `genre` column diffs (Python doesn't normalize). That's expected and is handled on the go-openaudio side — flag to whoever owns the parity job if it surfaces. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Collapses genre spelling variants — e.g.
Hip Hop/hip-hop/hiphop,r&b/rnb— to a single canonical form, building on the/v1/genres/popularendpoint added in #961.Design decisions
Normalize on read (display only), not on write, and not on filter params. The Go bridge service does not write the
tracks.genrecolumn — tracks (including genre) are populated upstream by the Python discovery provider during chain indexing. So there is no write path in this repo to hook.NormalizeGenre()(api/genre_normalize.go):hip-hop/rap→Hip-Hop/Rap)R&B,EDM,DJ,Hip Hop,Drum & Bass,Lo-Fi,K-Pop,J-PopElectronic,R&B,Hip Hop) pass through unchanged/v1/genres/popular— normalize each name, then merge and sum counts for variants that collapse together, re-sort by total desc, and applymin_countto the merged total. Caveat (in code): the SQL stillGROUP BYs and paginates on the raw genre, so merging only catches variants within the same page; full at-rest aggregation requires normalizing on write upstream.Genre filter params were intentionally NOT normalized. An earlier revision normalized the
genreparam on the trending / underground / latest / users-genre-top endpoints, but that is unsound: because the storedgenrecolumn is not normalized at rest, canonicalizing the param stops it matching the raw stored value. CI caught this (TestGetLatestWithGenre: a query forLatestTestGenreAwas title-cased toLatesttestgenreaand matched zero rows). Reverted in 01ba2ba — filtering is left untouched. Collapsing variants for filtering requires normalizing genre on write upstream.Tests
TestGenreNormalizecovers trim, internal-whitespace collapse, casing,hip-hop/hiphop→Hip Hop,r&b/rnb→R&B, special cases, and already-canonical pass-through.go buildandgo vet ./api/are clean. DB-backed tests run in CI (Docker is broken in the local dev environment).🤖 Generated with Claude Code