Skip to content

Add synonym expansion to component lexical search#2425

Open
Mbeaulne wants to merge 1 commit into
06-18-normalize_component_search_tokens_for_better_matchingfrom
06-18-add_synonym_groups
Open

Add synonym expansion to component lexical search#2425
Mbeaulne wants to merge 1 commit into
06-18-normalize_component_search_tokens_for_better_matchingfrom
06-18-add_synonym_groups

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds a synonym expansion system to the component lexical search so that queries using common aliases resolve to the intended components. For example, searching gcs now surfaces storage-related components, fit surfaces training components, infer surfaces prediction components, and df surfaces dataframe/table components.

A new componentSearchSynonyms.ts module defines synonym groups (e.g. gcs ↔ storage ↔ bucket, train ↔ fit, predict ↔ infer, df ↔ dataframe ↔ table) and exposes expandSynonymTokens, which fans out any recognized token into all members of its group.

The search pipeline was also refactored to separate base tokenization (baseSearchTokens) from the full normalized text used for document indexing. Synonym expansion is applied to query tokens before scoring, and the phrase-match bonus now uses the original (pre-expansion) token sequence so multi-word phrase matching remains accurate.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • New feature
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

  1. Open the component search panel.
  2. Search for gcs and confirm storage/bucket components appear at the top.
  3. Search for fit and confirm model training components appear.
  4. Search for infer and confirm prediction components appear.
  5. Search for df and confirm dataframe/table components appear.
  6. Verify that multi-word phrase matching (e.g. train test split) still correctly ranks exact name matches above partial matches.

Additional Comments

The synonym groups are intentionally domain-neutral and kept in a single flat list in componentSearchSynonyms.ts to make it easy to extend with additional aliases in the future. THIS IS NOT AN EXHAUSTIVE LIST

@Mbeaulne Mbeaulne changed the title add synonym groups Add synonym expansion to component lexical search Jun 18, 2026
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-add_synonym_groups/f5a29c0

Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchSynonyms.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from 2655160 to dce82a1 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-add_synonym_groups branch from dce82a1 to f5a29c0 Compare June 18, 2026 20:28
@camielvs

Copy link
Copy Markdown
Collaborator

🤖 Code review — Add synonym expansion to component lexical search

The architecture here is the strong part: synonyms expand on the query side only, and the comment explaining why (expanding the index too would intersect a query token set against a ballooned index token set and surface components matching neither literal text nor intent) is exactly right. The single-word-key guard and the baseSearchTokens/phraseTokens refactor are clean, and routing the contiguous-name +10 bonus through phraseTokens (synonym-free) is the correct call.

Findings:

  • Multi-word synonym entries are inert dead data. singleWordTerms filters group members to /^[a-z0-9]+$/, and the expansion target is also singleWordTerms. So "cloud storage", "object storage", "data frame", "comma separated", "language model", "chat model" are neither keys nor targets — they never participate in matching at all. A reader scanning SYNONYM_GROUPS will reasonably assume gcs → "cloud storage" works; it doesn't. Either drop them or add a comment that they're documentation-only placeholders for a future phrase-aware pass.

  • Synonym expansion stacks scores within a group. scoreEntry adds field weight per matched token, and an expanded query injects every group member as a token. A component whose description contains several members of one group accumulates weight for each — e.g. desc "train and fit the model", query train[train, fit, training, trainer]+2 (train) +2 (fit) on the description field. Layered on top of the stem double-count from Normalize component search tokens for better matching #2424, scoring increasingly rewards surface-variant density over relevance. Worth collapsing to one contribution per synonym group per field (hopefully Improve component search scoring relevance #2426 territory).

  • A few synonyms are broad enough to inject noise. table ↔ df/dataframe, score ↔ predict/infer, vector ↔ embed. table, score, and vector are common standalone terms with meanings outside this group (a UI table, an eval score, a math vector). Query-side-only expansion caps the blast radius, but someone searching table now also pulls dataframe/df components. Probably acceptable — just confirm it matches user intent.

  • Nit: phraseTokens still interleaves stem variants (training testing → "training train testing test"), so the contiguous-name +10 bonus won't fire for inflected multi-word queries. The common non-inflected case (train test split → train_test_split) still works fine; only inflected multiword phrases miss the bonus.

Synonym list itself is sensible domain coverage for an ML component library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants