Skip to content

feat: normalize genre variants to a canonical form#962

Merged
dylanjeffers merged 3 commits into
mainfrom
feat/genre-normalization
Jun 18, 2026
Merged

feat: normalize genre variants to a canonical form#962
dylanjeffers merged 3 commits into
mainfrom
feat/genre-normalization

Conversation

@dylanjeffers

@dylanjeffers dylanjeffers commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What

Collapses genre spelling variants — e.g. Hip Hop / hip-hop / hiphop, r&b / rnb — to a single canonical form, building on the /v1/genres/popular endpoint added in #961.

Design decisions

Normalize on read (display only), not on write, and not on filter params. The Go bridge service does not write the tracks.genre column — tracks (including genre) are populated upstream by the Python discovery provider during chain indexing. So there is no write path in this repo to hook.

NormalizeGenre() (api/genre_normalize.go):

  • trims surrounding whitespace, collapses internal whitespace runs
  • title-cases by default, preserving internal separators (hip-hop/rapHip-Hop/Rap)
  • maps known special cases via a collapsed lookup key (lowercase, alphanumeric-only) so every punctuation/spacing variant routes through one entry: R&B, EDM, DJ, Hip Hop, Drum & Bass, Lo-Fi, K-Pop, J-Pop
  • already-canonical values (Electronic, R&B, Hip Hop) pass through unchanged

/v1/genres/popular — normalize each name, then merge and sum counts for variants that collapse together, re-sort by total desc, and apply min_count to the merged total. Caveat (in code): the SQL still GROUP BYs and paginates on the raw genre, so merging only catches variants within the same page; full at-rest aggregation requires normalizing on write upstream.

Genre filter params were intentionally NOT normalized. An earlier revision normalized the genre param on the trending / underground / latest / users-genre-top endpoints, but that is unsound: because the stored genre column is not normalized at rest, canonicalizing the param stops it matching the raw stored value. CI caught this (TestGetLatestWithGenre: a query for LatestTestGenreA was title-cased to Latesttestgenrea and matched zero rows). Reverted in 01ba2ba — filtering is left untouched. Collapsing variants for filtering requires normalizing genre on write upstream.

Tests

TestGenreNormalize covers trim, internal-whitespace collapse, casing, hip-hop/hiphopHip Hop, r&b/rnbR&B, special cases, and already-canonical pass-through.

go test ./api/ -run TestGenreNormal   # PASS (24 subtests)

go build and go vet ./api/ are clean. DB-backed tests run in CI (Docker is broken in the local dev environment).

🤖 Generated with Claude Code

dylanjeffers and others added 3 commits June 17, 2026 17:40
Collapse genre spelling variants (e.g. "Hip Hop"/"hip-hop"/"hiphop",
"r&b"/"rnb") to a single canonical name across the read paths.

- Add NormalizeGenre() (api/genre_normalize.go): trims + collapses
  whitespace, title-cases, and maps known special cases (R&B, EDM, DJ,
  Hip Hop, Drum & Bass, ...) that should not be plain title-cased.
- /v1/genres/popular: normalize names and merge + sum counts for
  variants that collapse to the same canonical name, then re-sort and
  apply min_count to the merged totals.
- Normalize the `genre` filter param on the trending, underground,
  latest, and users/genre/top endpoints so case/whitespace variants
  match the canonical stored value.
- Add TestGenreNormalize covering trim, casing, hip-hop/hiphop -> Hip
  Hop, r&b -> R&B, and already-canonical pass-through.

The Go service does not write the tracks.genre column (it is populated
upstream by the discovery provider), so normalization is applied on
read. Fully collapsing variants at rest requires normalizing on write
upstream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Normalizing the `genre` query param broke filtering: the tracks.genre
column is written upstream (discovery provider) and is NOT normalized at
rest, so canonicalizing the param no longer matches the raw stored
value. CI surfaced this via TestGetLatestWithGenre — a query for
"LatestTestGenreA" was title-cased to "Latesttestgenrea" and matched
zero rows.

Revert NormalizeGenre on the trending, underground, latest, and
users/genre/top filter params. Keep normalization only on the
/v1/genres/popular response, which is a display-layer transform that
does not depend on matching stored values. Fully collapsing variants
for filtering requires normalizing genre on write upstream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NormalizeGenre now emits the protocol's canonical genre spellings as
defined by go-openaudio's GenreAllowlist (the upstream ETL genre write
path), so the API's normalized output matches what the indexer treats
as canonical:

- "Hip Hop"      -> "Hip-Hop/Rap"
- "R&B"          -> "R&B/Soul"
- Drum & Bass, Lo-Fi unchanged (already allowlist forms)

Drop the speculative K-Pop/J-Pop entries that are not in the allowlist;
they now fall through to generic title-casing. EDM/DJ are kept purely as
acronym casing fixes (not allowlist genres). Test assertions updated to
the new canonical forms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dylanjeffers dylanjeffers merged commit 0b5ce15 into main Jun 18, 2026
5 checks passed
@dylanjeffers dylanjeffers deleted the feat/genre-normalization branch June 18, 2026 01:26
dylanjeffers added a commit that referenced this pull request Jun 18, 2026
## What

Bumps both go-openaudio pins to the merge commit of **go-openaudio
#367** (`5aa118b67bc19474a56f81a6665f2f47854d72e4`), which normalizes
genre at the **ETL write path** (`entity_manager` track create/update).

```
github.com/OpenAudio/go-openaudio          v1.3.1-...-7758ae709d18 -> v1.4.1-0.20260618012656-5aa118b67bc1
github.com/OpenAudio/go-openaudio/pkg/etl  v1.3.1-...-7758ae709d18 -> v1.4.1-0.20260618012656-5aa118b67bc1
```

Both pins are kept in lockstep (the root module and the `/pkg/etl`
submodule), as before.

## Why

This repo runs the production indexer by vendoring go-openaudio's ETL
(`indexer/indexer.go` → `etl.Indexer.Run()`); it does not implement
track indexing itself. #367 makes the indexer write **canonical genre
values at rest**, so genre variants (`hip-hop`/`hiphop` → `Hip-Hop/Rap`,
etc.) now collapse on write upstream. That is the durable fix behind the
read-side, display-only stopgap merged in #962 — and unlike the API read
path, write-side normalization also fixes genre **filtering**.

## Verification

- `go get` both modules @ the SHA, then `go mod tidy` — only
`go.mod`/`go.sum` changed.
- `go build ./...` — passes
- `go vet ./...` — passes

## Heads-up for reviewers

With the indexer now normalizing genre, the ETL parity check against the
legacy Python discovery-provider may begin flagging `genre` column diffs
(Python doesn't normalize). That's expected and is handled on the
go-openaudio side — flag to whoever owns the parity job if it surfaces.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant