Skip to content

autotagging: introduce Source class for matching#6681

Open
snejus wants to merge 2 commits into
masterfrom
introduce-source
Open

autotagging: introduce Source class for matching#6681
snejus wants to merge 2 commits into
masterfrom
introduce-source

Conversation

@snejus
Copy link
Copy Markdown
Member

@snejus snejus commented May 29, 2026

beets.autotag Source-Centric Refactor

Fixes: #6684

Core change

This PR restructures autotagging around a new beets.autotag.source.Source object.

Before this change, album and singleton matching pulled "current metadata" from several places:

  • importer tasks
  • matching code
  • distance calculation
  • UI display code

After this change, those flows all receive the same input shape: a single Source value that carries:

  • the original items
  • the current display/search fields (artist, name)
  • extracted metadata in data
  • current external ID information (id, id_consensus)

That makes Source the new "current thing being matched" abstraction for both albums and tracks.


Architecture

1. New single input object: Source

Source centralizes metadata extraction that used to be scattered across the autotag pipeline.

Before
items
├─ importer computes current artist/album
├─ match code recomputes common tags
├─ distance recomputes common tags
└─ UI receives separate current values

After
Source
├─ source.artist
├─ source.name
├─ source.data
├─ source.items
├─ source.id / source.id_consensus
└─ source.va_likely

This is the biggest architectural shift in the PR.

2. tag_album() and tag_item() now operate on Source

The matching entry points now take a Source instead of a loose mix of:

  • raw items
  • item
  • current artist/title values
  • repeated ID lookups

That gives album and track matching the same calling convention and reduces special-case plumbing.

# Before
tag_album(items, ...)
tag_item(item, ...)

# After
tag_album(source, ...)
tag_item(source, ...)

3. distance() is now narrower and more explicit

distance() no longer derives "current metadata" internally from a list of items.

Instead, it receives:

  • original: the already-extracted metadata container
  • unmatched_count: the number of unmatched local tracks
Before
distance(items, album_info, item_info_pairs)
  -> recompute common tags from items
  -> infer unmatched track count

After
distance(source.data, album_info, item_info_pairs, unmatched_count)
  -> compare against already-prepared metadata

This is a good separation of responsibilities:

  • Source prepares current metadata
  • distance() only compares metadata

4. Importer tasks now cache the current source

ImportTask and SingletonImportTask now expose a cached source property:

  • album tasks use Source.from_items(...)
  • singleton tasks use Source.from_item(...)

That means importer state no longer needs separate cur_artist and cur_album fields, and downstream code can read from task.source instead.

5. UI/display code now depends on the same source model

Import session and display functions now use task.source.artist and task.source.name when showing candidate changes or prompting for manual search/ID entry.

This removes another copy of "current metadata" logic from the UI layer.

6. AttrDict moves to beets.util

AttrDict was previously defined inside beets.autotag.hooks. It now lives in beets.util.

High-level effect:

  • metadata-container behavior is treated as shared infrastructure
  • autotag-specific modules depend on a common utility instead of owning the abstraction

7. beets.autotag public API is flatter

Related cleanup continues by exporting more autotag types directly from beets.autotag, including Source.

This reduces import-path churn for callers and makes the package boundary easier to understand.


High-level impact

What this improves

1. One place defines "the current metadata we are matching from"

The main benefit is conceptual clarity.

Reviewers can now think about autotagging as:

local files -> Source -> candidate lookup -> distance -> proposal/UI

instead of following several separate paths that each rediscover the same metadata.

2. Album and singleton flows now look alike

Albums and single tracks now use the same structure and similar call patterns.

That should make future autotag changes easier because new behavior can usually be added once at the Source boundary rather than twice in parallel code paths.

3. Less coupling between importer, matcher, distance logic, and UI

Previously, each layer needed to know how to reconstruct "current artist/album/common tags".

Now:

  • importer builds Source
  • matching uses Source
  • UI displays from Source
  • distance() receives prepared metadata

Each layer does less guessing about the others.

4. Simpler data flow for debugging and future refactors

When a match looks wrong, there is now a clearer place to inspect the original input: Source.

That should help with debugging, logging, and future changes to metadata extraction rules.


Data flow at a glance

ImportTask / SingletonImportTask
        ↓
     `task.source`
        ↓
`tag_album(source)` / `tag_item(source)`
        ↓
candidate search + item/track assignment
        ↓
`distance(source.data, info, pairs, unmatched_count)`
        ↓
`Proposal`
        ↓
UI display and user choice

Notable follow-on cleanup included here

  • get_most_common_tags() now returns only the extracted metadata dictionary, not a (likelies, consensus) pair
  • call sites that depended on consensus now compute only the specific signals they need, such as id_consensus or va_likely
  • tests were updated to reflect the new Source-based flow
  • .git-blame-ignore-revs was updated for mechanical refactor commits

Reviewer guide

If reviewing this PR, the easiest way to read it is:

  1. Start with beets/autotag/source.py
  2. Then look at tag_album() and tag_item() in beets/autotag/match.py
  3. Then review the distance() signature change
  4. Finally scan importer/UI call sites to see how they now pass task.source

That path shows the architectural intent without getting lost in the mechanical updates.

Copilot AI review requested due to automatic review settings May 29, 2026 18:39
@snejus snejus requested review from a team, asardaes, henry-oberholtzer and semohr as code owners May 29, 2026 18:39
@github-actions
Copy link
Copy Markdown

Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 77.14286% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.64%. Comparing base (c4ed45a) to head (de7492d).

Files with missing lines Patch % Lines
beets/autotag/distance.py 30.43% 10 Missing and 6 partials ⚠️
beets/ui/commands/import_/session.py 43.75% 8 Missing and 1 partial ⚠️
beets/util/__init__.py 93.54% 1 Missing and 1 partial ⚠️
beetsplug/bench.py 33.33% 2 Missing ⚠️
beetsplug/ihate.py 0.00% 2 Missing ⚠️
beets/autotag/match.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6681      +/-   ##
==========================================
+ Coverage   72.60%   72.64%   +0.03%     
==========================================
  Files         162      163       +1     
  Lines       20810    20819       +9     
  Branches     3294     3290       -4     
==========================================
+ Hits        15109    15123      +14     
+ Misses       4977     4973       -4     
+ Partials      724      723       -1     
Files with missing lines Coverage Δ
beets/autotag/__init__.py 69.23% <100.00%> (+2.56%) ⬆️
beets/autotag/hooks.py 100.00% <100.00%> (+0.92%) ⬆️
beets/autotag/source.py 100.00% <100.00%> (ø)
beets/importer/tasks.py 91.28% <100.00%> (+0.14%) ⬆️
beets/ui/commands/import_/display.py 81.59% <100.00%> (ø)
beetsplug/mbpseudo.py 84.61% <100.00%> (ø)
beets/autotag/match.py 88.09% <92.85%> (-0.17%) ⬇️
beets/util/__init__.py 79.80% <93.54%> (+0.57%) ⬆️
beetsplug/bench.py 27.90% <33.33%> (-0.67%) ⬇️
beetsplug/ihate.py 59.37% <0.00%> (+3.81%) ⬆️
... and 2 more
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

big change — grug nervous but also see good idea. PR introduce Source value object so importer, matcher, distance, and UI all share one current-metadata shape instead of each layer re-cooking common tags from raw items. also move AttrDict, Match/AlbumMatch/TrackMatch, and many exports around so beets.autotag becomes flatter public surface.

Changes:

  • new beets/autotag/source.py with Source.from_items / Source.from_item; tag_album and tag_item now take Source and return only Proposal
  • distance() now take original: AttrDict + unmatched_count (no longer recompute common tags); get_most_common_tags() returns only likelies
  • AttrDict moved to beets.util; Match/AlbumMatch/TrackMatch moved to beets/autotag/match.py; cur_artist/cur_album removed from ImportTask; many import paths flattened to beets.autotag

Reviewed changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
beets/autotag/source.py new Source NamedTuple, central current-metadata object
beets/autotag/match.py tag_album/tag_item take Source; Match/AlbumMatch/TrackMatch moved here
beets/autotag/distance.py distance() takes original AttrDict + unmatched_count
beets/autotag/hooks.py AttrDict and Match classes removed; uses beets.util.AttrDict
beets/autotag/init.py flatter re-exports incl. Source, Distance, Match, etc.
beets/util/init.py AttrDict lives here; get_most_common_tags returns one dict
beets/importer/tasks.py task.source cached_property; removes cur_artist/cur_album; new lookup_candidates signatures
beets/ui/commands/import_/session.py choose_candidate/manual flows use task.source
beets/ui/commands/import_/display.py show_change/show_item_change take artist/name strings directly
beetsplug/mbpseudo.py use Source.from_items(...).data + unmatched_count for distance()
beetsplug/bench.py only updates tag_album import (call site still passes items — broken)
many beetsplug/* + tests mechanical import path updates to from beets.autotag import ...
.git-blame-ignore-revs adds refactor commit hashes

other notable findings: beetsplug/ihate.py still references removed task.cur_artist/cur_album; beetsplug/titlecase.py TYPE_CHECKING imports Info from beets.autotag where it is not exported; several docstrings (tag_album, get_most_common_tags, choose_candidate, Source.from_item, importer/stages.py) still describe pre-refactor return values or parameters.

Comments suppressed due to low confidence (1)

beets/ui/commands/import_/session.py:371

  • docstring still describe old parameters cur_artist, cur_album, itemcount, item and singleton. new signature only take candidates, rec, source, choices. update docstring so future grug not chase ghost args.
def choose_candidate(candidates, rec, source: Source, choices=[]):
    """Given a sorted list of candidates, ask the user for a selection
    of which candidate to use. Applies to both full albums and
    singletons  (tracks). Candidates are either AlbumMatch or TrackMatch
    objects depending on `singleton`. for albums, `cur_artist`,
    `cur_album`, and `itemcount` must be provided. For singletons,
    `item` must be provided.

Comment thread beetsplug/bench.py Outdated
Comment thread beets/importer/tasks.py
Comment thread beets/autotag/source.py Outdated
Comment thread beets/util/__init__.py
Comment thread beets/autotag/match.py
Comment thread beetsplug/titlecase.py
@snejus snejus force-pushed the introduce-source branch 3 times, most recently from 7322755 to 95e0a81 Compare May 29, 2026 19:35
@semohr
Copy link
Copy Markdown
Contributor

semohr commented May 29, 2026

Am having a bit of a hard time understanding the changes here since you also moved functions around quite a bit. For reviewing purpose, would you mind splitting the refactor into another PR?

@snejus
Copy link
Copy Markdown
Member Author

snejus commented May 30, 2026

Am having a bit of a hard time understanding the changes here since you also moved functions around quite a bit. For reviewing purpose, would you mind splitting the refactor into another PR?

Of course - I will split the first two commits into a separate PR.

@snejus
Copy link
Copy Markdown
Member Author

snejus commented May 30, 2026

See #6687

@snejus snejus changed the base branch from master to tidy-up-autotag May 30, 2026 12:20
@snejus snejus force-pushed the introduce-source branch from 95e0a81 to 34ae2e4 Compare May 30, 2026 12:23
@snejus snejus force-pushed the tidy-up-autotag branch from a2122e2 to 749f065 Compare May 30, 2026 12:30
@snejus snejus force-pushed the introduce-source branch from 34ae2e4 to a97eab5 Compare May 30, 2026 12:30
@snejus snejus requested a review from Copilot May 30, 2026 12:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Comment thread beets/autotag/distance.py Outdated
Comment thread beets/ui/commands/import_/session.py
Comment thread beets/ui/commands/import_/display.py Outdated
Comment thread beets/ui/commands/import_/session.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

test/ui/commands/test_import.py:94

  • grug spy bug here. show_change() signature now (source: Source, match: AlbumMatch) per beets/ui/commands/import_/display.py, but this test still pass old three-arg shape (artist_str, album_str, match). Test will fail at call time. Need a Source whose artist/name match the expected "another artist with ..." / "another album" strings used in the assertions below, e.g. build one via Source.from_items over items carrying those values, or construct Source(...) directly. grug skip fix because right Source shape is judgment call that affect expected snapshot text.
        show_change(
            f"another artist with {long_name}",
            "another album",
            AlbumMatch(change_dist, info, dict(item_info_pairs)),
        )

@snejus snejus force-pushed the introduce-source branch 2 times, most recently from da82f8f to 067663c Compare June 2, 2026 00:21
@snejus snejus force-pushed the tidy-up-autotag branch from f1d1187 to fd41e39 Compare June 2, 2026 00:29
@snejus snejus force-pushed the introduce-source branch 2 times, most recently from ab82b5c to b435443 Compare June 2, 2026 07:46
@snejus snejus force-pushed the tidy-up-autotag branch 3 times, most recently from ecf83ab to 3e67a14 Compare June 2, 2026 08:01
@snejus snejus force-pushed the introduce-source branch from b435443 to 054e0d6 Compare June 2, 2026 08:02
@snejus snejus force-pushed the tidy-up-autotag branch from 3e67a14 to 8e27f56 Compare June 2, 2026 08:16
@snejus snejus force-pushed the introduce-source branch from 054e0d6 to 864a80e Compare June 2, 2026 08:16
@snejus snejus force-pushed the tidy-up-autotag branch from 8e27f56 to 6f5e39e Compare June 2, 2026 14:47
@snejus snejus force-pushed the introduce-source branch from 864a80e to d60f180 Compare June 2, 2026 14:48
@snejus snejus force-pushed the tidy-up-autotag branch from 6f5e39e to b582f4c Compare June 2, 2026 15:28
@snejus snejus force-pushed the introduce-source branch from d60f180 to 605a387 Compare June 2, 2026 15:29
@snejus snejus force-pushed the tidy-up-autotag branch 5 times, most recently from 238a028 to fb9de7a Compare June 3, 2026 14:42
@snejus snejus force-pushed the introduce-source branch from 605a387 to 6db01e0 Compare June 3, 2026 14:42
Comment thread beets/autotag/distance.py

def distance(
items: Sequence[Item],
original: AttrDict[Any],
Copy link
Copy Markdown
Contributor

@semohr semohr May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume this should take a Source object now instead of an AttrDict[Any]?

The function arguments should only be these three in my opinion:

  • Candidate (AlbumInfo)
  • Source
  • Assignment (item_info_pairs)

See here for more information and my understanding of it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See part of the docstring:

    `original` is an AttrDict with the original album metadata.

It previously needed items in order to calculate likelies:

    likelies, _ = get_most_common_tags(items)

Now, this calculation is done by Source and the data is assigned to an attribute Source.data. It is then fed into distance, since it doesn't need any other data.

I made it an AttrDict so that keys can be accessed as attribute, to make these calls consistent:

    dist.add_string("album", original.album, album_info.album)

Copy link
Copy Markdown
Contributor

@semohr semohr Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we make Source.data a cached property and give Source as parameter here than. Seems like a slightly more clear abstraction for my mental model.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit weird that we need to precompute the get_most_common_tags in the from_item methods anyways.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we make Source.data a cached property and give Source as parameter here than. Seems like a slightly more clear abstraction for my mental model.

Two issues with this:

  1. data is different for each Source type so we compute it either in from_item or from_items
  2. data is needed in two places: for distance and in ImportTask.chosen_info so it does indeed need to be pre-computed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of that shouldn't be an issue anymore once the ImportTask and SentienlImportTask combine. This is one of the main goals here, right?

Comment thread beets/autotag/distance.py Outdated
Comment thread beets/autotag/source.py
)

@classmethod
def from_items(cls, items: Sequence[Item]) -> Source:
Copy link
Copy Markdown
Contributor

@semohr semohr Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the items creation into here? Maybe with a from_path alternate constructor?

I can see this being a bit difficulties with integrating this into current task factory but might simplify some things further and would align with my understanding of source.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I kind of see the direction here. Are you saying we could get Source within ImportTaskFactory.singleton and ImportTaskFactory.album? And then replace ImportTask.items with ImportTask.source in the construction?

Copy link
Copy Markdown
Contributor

@semohr semohr Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is roughtly my idea. If we dont work with items but work with the source it should allow us to abstract most operations regardless of if we look at an album or singleton.

Base automatically changed from tidy-up-autotag to master June 3, 2026 15:15
@snejus snejus force-pushed the introduce-source branch from 6db01e0 to c99d420 Compare June 3, 2026 15:16
snejus added 2 commits June 3, 2026 19:19
Introduce Source class to encapsulate metadata extraction logic,
replacing scattered `get_most_common_tags` calls throughout codebase.

- Add Source class in importer.tasks with artist, name, and metadata
- Simplify `distance()` to accept Source.data instead of items list
- Update display functions to use Source objects
- Cache Source creation with @cached_property on ImportTask

This centralizes metadata handling and reduces coupling between
autotag and library modules.
@snejus snejus force-pushed the introduce-source branch from c99d420 to de7492d Compare June 3, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce Source Abstraction for Autotagger Context

3 participants