Skip to content

MoE code fixes#453

Merged
rrutmann merged 4 commits into
moefrom
moe_fixes
Jun 18, 2026
Merged

MoE code fixes#453
rrutmann merged 4 commits into
moefrom
moe_fixes

Conversation

@gbesposito

Copy link
Copy Markdown
Collaborator

What does this PR do?

This PR fixes some general issues that were in the MoE code

General Changes

  • Removes torchtitan dependency by integrating a torch-native Expert Parallelism class, inspired by torchtitan's
  • Fixes a bug in the MixedPrecision policy that was keeping experts' parameters' in FP32 instead of casting them
  • Replaces the TP-dimension workaround by introducing a first-class EP dimension into the device mesh

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@gbesposito gbesposito requested a review from rrutmann June 17, 2026 07:08

@rrutmann rrutmann left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@rrutmann rrutmann merged commit ee2c5dc into moe Jun 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants