Skip to content

fix crash in when running FSDP2+TP#43226

Merged
ArthurZucker merged 5 commits into
huggingface:mainfrom
sywangyi:fsdp+tp
Jan 14, 2026
Merged

fix crash in when running FSDP2+TP#43226
ArthurZucker merged 5 commits into
huggingface:mainfrom
sywangyi:fsdp+tp

Conversation

@sywangyi

Copy link
Copy Markdown
Contributor

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@sywangyi

Copy link
Copy Markdown
Contributor Author

running https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/nd_parallel_trainer.py as accelerate launch --config-file configs/tp_hsdp.yaml nd_parallel_trainer.py. crash like
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel_trainer.py", line 82, in
[rank1]: main()
[rank1]: File "/mnt/disk3/wangyi/accelerate/examples/torch_native_parallelism/nd_parallel_trainer.py", line 77, in main
[rank1]: trainer.train()
[rank1]: File "/mnt/disk3/wangyi/transformers/src/transformers/trainer.py", line 2174, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/disk3/wangyi/transformers/src/transformers/trainer.py", line 2339, in _inner_training_loop
[rank1]: self.optimizer = self.accelerator.prepare(self.optimizer)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/mnt/disk3/wangyi/accelerate/src/accelerate/accelerator.py", line 1515, in prepare
[rank1]: raise ValueError(
[rank1]: ValueError: When using FSDP2, a model and optimizer must be passed together to Accelerator.prepare() as the optimizer needs to have its parameters modified after the model is converted.

@SunMarc SunMarc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Just a nit and we can merge

Comment thread src/transformers/trainer.py Outdated
Comment thread src/transformers/trainer.py Outdated
Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@3outeille

Copy link
Copy Markdown
Member

making FSDP2 + TP integration smoother is in the roadmap. Almost done with TP, going to tackle FSDP2 next

@SunMarc SunMarc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot !

@github-actions

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43226&sha=754498

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ArthurZucker ArthurZucker merged commit 77146cc into huggingface:main Jan 14, 2026
21 of 25 checks passed
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* fix crash in when running FSDP2+TP

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

* update

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

* fix tp only issue

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
@sywangyi sywangyi deleted the fsdp+tp branch May 20, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants