[inductor] Multi-output custom op ExternKernelOut lowering#176117
[inductor] Multi-output custom op ExternKernelOut lowering#176117tianrengao wants to merge 10 commits into
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176117
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit d8993a9 with merge base f0bae19 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
eellison
left a comment
There was a problem hiding this comment.
would you mind combining the above test coverage pr with this pr
Switch the scaled_fp4_quant Python wrapper from calling .out (with pre-allocated buffers) to the functional variant (C++ allocates internally). This enables PyTorch Inductor (2.12+) to automatically discover the .out overload via to_out_variant() and manage output buffers for buffer reuse optimization. Depends on: - pytorch/pytorch#175116 (custom op out-variant lowering hook) - pytorch/pytorch#176117 (multi-output support) Signed-off-by: tianrengao <terrygao87@gmail.com>
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: a7014a2 Pull Request resolved: #176117
#175116 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: a8748a4 Pull Request resolved: #176117
#175116 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: 169dc39 Pull Request resolved: #176117
#175116 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: 0e109f6 Pull Request resolved: #176117
|
|
||
| def _codegen_input_args(node: ir.ExternKernel) -> list[str]: | ||
| """Codegen positional + constant args, excluding out args.""" | ||
| assert ir.is_node_sequence(node.inputs) | ||
| args = [x.codegen_reference() for x in node.inputs] # type: ignore[union-attr] | ||
| for const in node.constant_args: | ||
| args.append(V.graph.wrapper_code.val_to_arg_str(const)) | ||
| return args | ||
|
|
||
|
|
||
| def _codegen_kwargs(node: ir.ExternKernel, skip_names: OrderedSet[str]) -> list[str]: | ||
| """Codegen keyword args, skipping out arg names (appended separately).""" | ||
| result = [] | ||
| for k, v in node.kwargs.items(): | ||
| if k not in skip_names: | ||
| result.append(f"{k}={V.graph.wrapper_code.val_to_arg_str(v)}") | ||
| return result |
There was a problem hiding this comment.
Can we normalize this, with the rest of the codebase ? see:
pytorch/torch/_inductor/codegen/wrapper.py
Line 584 in c95cc4d
this is in a completely different place.
There was a problem hiding this comment.
Done. Added ExternKernelMultiOutLine(WrapperLine) right after ExternKernelOutLine in wrapper.py and removed the entire custom_op_out_lowering.py
| return torch.ops.mylib.split_add, torch.ops.mylib.split_add.out | ||
|
|
||
| @parametrize("device", DEVICES) | ||
| def test_multi_output_lowered_to_out(self, device): |
There was a problem hiding this comment.
Does this work with cpp wrapper ? can we either add support for this or fallback if its not supported ?
There was a problem hiding this comment.
Added cpp_wrapper guard. multioutput .out lowering is skipped under cpp wrapper, falling back to standard FallbackKernel codegen.
cpp wrapper support can be added later when needed.
| def try_lower_multi_output_to_out_variant( | ||
| kernel: OpOverload, | ||
| example_output: Any, | ||
| packed: ir.FallbackKernel, | ||
| *, | ||
| has_unaligned_input: bool = False, | ||
| ) -> Optional[Sequence[AllocatingMultiOutput]]: |
There was a problem hiding this comment.
if it doesn't make sense to refactor ExternKernelOut to work with the new changes (it may not,)can you add a new class ExternKernelMultiOut or something? and normalize that to where ExternKernel is in the codebase.
There was a problem hiding this comment.
Yes, ExternKernelOut can't be extended for multi-output because it's a single-buffer node.
Added ExternKernelMultiOut in ir.py. It inherits FallbackKernel to reuse codegen_args/unflatten_args infrastructure, and overrides codegen() to emit .out() calls. Also deleted custom_op_out_lowering.py
#175116 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo mlazos [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: 70ba508 Pull Request resolved: #176117
#175116 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo mlazos [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: 95cbcd6 Pull Request resolved: #176117
#175116 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo mlazos [ghstack-poisoned]
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: 61e5d34 Pull Request resolved: #176117
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Extends PR1 single-output lowering to handle multi-output functional custom ops. Adds CustomOpMultiOutputNode (packed node that emits the .out() call) and AllocatingMultiOutput (MultiOutput child with should_allocate=True for buffer reuse). Out arg names from the schema (e.g. out0, out1, output, output_scale) are used in codegen. ghstack-source-id: 0e1d843 Pull Request resolved: pytorch/pytorch#176117
The out_variant tag is not yet available in torch==2.10.0. See pytorch/pytorch#176117. Signed-off-by: tianrengao <terrygao87@gmail.com>
…76117) pytorch#175116 Pull Request resolved: pytorch#176117 Approved by: https://github.com/eellison
Stack from ghstack (oldest at bottom):
[inductor] Add out-variant discovery + custom op lowering entry point #175116
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos