Add bf16-f32-vcvt kernels for avx512/avx512bf16#10023
Merged
copybara-service[bot] merged 2 commits intoApr 28, 2026
Merged
Conversation
27 tasks
dsharlet
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add AVX512 and AVX512_BF16 kernels for f32<->bf16 vcvt. For the non _BF16 fp32->bf16 kernels, I match the scalar kernel rounding logic, so it should be strictly correct (including for NaNs and Infs).
Performance looks like it hits the memory wall at larger sizes, but the native BF16 convert pulls ahead significantly (~2x) at smaller sizes. It generally tracks FP16 across the board on Genoa.
I verified tests pass with CMake on x86. I also verified that Bazel build of //:XNNPACK succeeds on x86 and CMake builds on ARM Mac.
Benchmarks
AMD Genoa
bf16 -> f32 (GB/s)
f32 -> bf16 (GB/s)