Skip to content

chore: upgrade Kubernetes deps and local images to 1.35 (incl. controller-runtime v0.23 and KAI-scheduler v0.15)#603

Open
yankay wants to merge 1 commit into
ai-dynamo:mainfrom
yankay:chore/602-upgrade-k8s-1.35
Open

chore: upgrade Kubernetes deps and local images to 1.35 (incl. controller-runtime v0.23 and KAI-scheduler v0.15)#603
yankay wants to merge 1 commit into
ai-dynamo:mainfrom
yankay:chore/602-upgrade-k8s-1.35

Conversation

@yankay

@yankay yankay commented May 12, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Upgrades Grove's Kubernetes dependency baseline from 1.34 to 1.35.

  • k8s.io/*: v0.34.3 -> v0.35.5
  • sigs.k8s.io/controller-runtime: v0.22.4 -> v0.23.3
  • github.com/kai-scheduler/KAI-scheduler: v0.14.0 -> v0.15.2 (required for controller-runtime v0.23)
  • kindest/node: v1.34.3 -> v1.35.1
  • rancher/k3s: v1.34.2-k3s1 -> v1.35.5-k3s1

The installed KAI-scheduler version in docs and e2e tooling (docs/installation.md, operator/e2e/README.md, operator/hack/{e2e-cluster,infra_manager}/dependencies.yaml, operator/hack/e2e-cluster/create-e2e-cluster.py) is also bumped to v0.15.2.

KAI v0.15 changes PodGroupSpec.MinMember / SubGroup.MinMember from int32 to *int32; only the e2e SubGroup verifier needed a deref. Generated code refreshed via make generate and make generate-api-docs.

Two incidental changes that came along with the upgrade:

  • operator/go.mod: Go toolchain bumped 1.26.1 -> 1.26.3, and the now-unused operator/client require plus the k8s.io/kubelet replace directive were dropped by go mod tidy.
  • operator/e2e/setup/k8s_clusters.go: raised client-side rate limits (QPS 50 / Burst 100) above the client-go defaults (5/10); e2e polling loops were otherwise hitting client rate limiter Wait ... context deadline exceeded under the rolling/ondelete update tests on 1.35.

Which issue(s) this PR fixes:

Fixes #602

Special notes for your reviewer:

Local: go build ./..., go build -tags e2e ./e2e/..., go test ./..., go vet ./... all pass.

Related design PR: #605

Does this PR introduce a API change?

The following dependencies are updated:
- `k8s.io/*`: `v0.34.3` -> `v0.35.5`
- `sigs.k8s.io/controller-runtime`: `v0.22.4` -> `v0.23.3`
- `github.com/kai-scheduler/KAI-scheduler`: `v0.14.0` -> `v0.15.2`
- `kindest/node`: `v1.34.3` -> `v1.35.1`
- `rancher/k3s`: `v1.34.2-k3s1` -> `v1.35.5-k3s1`

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot

copy-pr-bot Bot commented May 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yankay

yankay commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

CI status: blocked on KAI-Scheduler upstream

All 7 failing E2E jobs share the same root cause, and it is not the kind/k3d 1.35 image bump itself. The k3d cluster (rancher/k3s:v1.35.4-k3s1) comes up fine; failure happens later when make run-e2e tries to compile the Go e2e test binary:

# github.com/kai-scheduler/KAI-scheduler/pkg/apis/scheduling/v2alpha2
.../kai-scheduler@v0.14.0/pkg/apis/scheduling/v2alpha2/podgroup_webhook.go:18:34:
  not enough arguments in call to ctrl.NewWebhookManagedBy
    have (controllerruntime.Manager)
    want (manager.Manager, T)
FAIL  github.com/ai-dynamo/grove/operator/e2e/tests [build failed]
make[1]: *** [Makefile:126: run-e2e] Error 1

controller-runtime v0.23 made ctrl.NewWebhookManagedBy generic (mgr, obj). operator/e2e/tests transitively imports github.com/kai-scheduler/KAI-scheduler v0.14.0, which still uses the v0.22 signature, so the e2e binary no longer compiles once we bump controller-runtime.

KAI-Scheduler side

The fix is already merged on main of kai-scheduler/KAI-Scheduler:

But no released tag contains it yet — the latest tags v0.14.0 / v0.14.1 / v0.14.2 and the v0.14 release branch are all still on controller-runtime v0.22.3. The change is sitting in CHANGELOG.md under [Unreleased].

Proposed path

Two options, in order of preference:

  1. Wait for the next KAI-Scheduler minor release (presumably v0.15.0) and bump to that here. There is no public ETA yet — happy to ping upstream to ask.
  2. Temporarily pin to kai-scheduler/KAI-Scheduler main via go get github.com/kai-scheduler/KAI-scheduler@1b591f419a01 so this PR is not held up, with a follow-up to swap back to a proper semver once v0.15.0 is cut.

Marking the PR as pending upstream while we decide. Suggestions welcome.

@danbar2

danbar2 commented May 12, 2026

Copy link
Copy Markdown
Contributor

CI status: blocked on KAI-Scheduler upstream

All 7 failing E2E jobs share the same root cause, and it is not the kind/k3d 1.35 image bump itself. The k3d cluster (rancher/k3s:v1.35.4-k3s1) comes up fine; failure happens later when make run-e2e tries to compile the Go e2e test binary:

# github.com/kai-scheduler/KAI-scheduler/pkg/apis/scheduling/v2alpha2
.../kai-scheduler@v0.14.0/pkg/apis/scheduling/v2alpha2/podgroup_webhook.go:18:34:
  not enough arguments in call to ctrl.NewWebhookManagedBy
    have (controllerruntime.Manager)
    want (manager.Manager, T)
FAIL  github.com/ai-dynamo/grove/operator/e2e/tests [build failed]
make[1]: *** [Makefile:126: run-e2e] Error 1

controller-runtime v0.23 made ctrl.NewWebhookManagedBy generic (mgr, obj). operator/e2e/tests transitively imports github.com/kai-scheduler/KAI-scheduler v0.14.0, which still uses the v0.22 signature, so the e2e binary no longer compiles once we bump controller-runtime.

KAI-Scheduler side

The fix is already merged on main of kai-scheduler/KAI-Scheduler:

But no released tag contains it yet — the latest tags v0.14.0 / v0.14.1 / v0.14.2 and the v0.14 release branch are all still on controller-runtime v0.22.3. The change is sitting in CHANGELOG.md under [Unreleased].

Proposed path

Two options, in order of preference:

  1. Wait for the next KAI-Scheduler minor release (presumably v0.15.0) and bump to that here. There is no public ETA yet — happy to ping upstream to ask.
  2. Temporarily pin to kai-scheduler/KAI-Scheduler main via go get github.com/kai-scheduler/KAI-scheduler@1b591f419a01 so this PR is not held up, with a follow-up to swap back to a proper semver once v0.15.0 is cut.

Marking the PR as pending upstream while we decide. Suggestions welcome.

Thanks for raising this PR !
Option 1 is the correct one, we should wait for KAI to upgrade as well.
cc: @enoodle

@renormalize renormalize left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kai-scheduler/KAI-Scheduler/releases/tag/v0.15.0 is out. Can the PR be rebased with the kai-scheduler bump? Thanks.

@yankay yankay force-pushed the chore/602-upgrade-k8s-1.35 branch 2 times, most recently from b9bdefd to b72fd55 Compare May 25, 2026 12:45
@yankay

yankay commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

https://github.com/kai-scheduler/KAI-Scheduler/releases/tag/v0.15.0 is out. Can the PR be rebased with the kai-scheduler bump? Thanks.

HI @renormalize

Done. Rebased on main and bumped KAI-scheduler v0.14.0 → v0.15.0

@yankay yankay changed the title chore: upgrade Kubernetes dependencies and local images to 1.35 chore: upgrade Kubernetes deps and local images to 1.35 (incl. controller-runtime v0.23 and KAI-scheduler v0.15) May 25, 2026
@yankay yankay force-pushed the chore/602-upgrade-k8s-1.35 branch from b72fd55 to 8b64e6a Compare May 26, 2026 06:57
@renormalize renormalize force-pushed the chore/602-upgrade-k8s-1.35 branch from 9d9e1af to 59c8a49 Compare May 26, 2026 09:54
@renormalize

Copy link
Copy Markdown
Collaborator

@yankay I took the liberty to sync the dependencies across the multiple Go modules we have in this repository, as they were not synced up after your rebased commit.

We can merge once the E2E tests pass.

@yankay

yankay commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@yankay I took the liberty to sync the dependencies across the multiple Go modules we have in this repository, as they were not synced up after your rebased commit.

We can merge once the E2E tests pass.

Thanks @renormalize for the cross-module dep sync :-)

renormalize
renormalize previously approved these changes May 26, 2026
@enoodle

enoodle commented May 27, 2026

Copy link
Copy Markdown
Contributor

@yankay Do you want to also update the installed KAI version in the docs / e2e scripts to v0.15.0 ?

@yankay

yankay commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @enoodle! Good catch — bumped the installed KAI version to v0.15.0 in docs and e2e tooling PTAL 🙏

Waiting for the fix to be released in KAI v0.5.1 or v0.6.0, then we can re-run and confirm this PR.

@renormalize

Copy link
Copy Markdown
Collaborator

@danbar2 any idea on when KAI plans to release v0.15.1 or v0.16.0?

@enoodle

enoodle commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@renormalize I will work on including this fix in v0.15.1 fix of KAI.
On the grove side we can advance #524 to disengage from KAI's logic to create the PodGroups.

@enoodle

enoodle commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@renormalize @yankay KAI v0.15.2 contains the fix you referenced. Can you update to use that?

@yankay yankay force-pushed the chore/602-upgrade-k8s-1.35 branch from 6ca589f to cd75d03 Compare June 11, 2026 08:19
@yankay yankay force-pushed the chore/602-upgrade-k8s-1.35 branch from cd75d03 to 24ffe87 Compare June 11, 2026 08:27
Update Grove's Kubernetes dependency baseline from 1.34 to 1.35.

- Bump k8s.io/* to v0.35.5 and controller-runtime to v0.23.3.
- Bump KAI Scheduler to v0.15.2 for controller-runtime v0.23 compatibility.
- Bump local kind and k3s images to the 1.35 line.
- Raise the operator module Go directive to 1.26.3 for the updated KAI Scheduler dependency.
- Refresh generated clients, CRDs, docs, and e2e install pins.

Co-authored-by: Saketh Kalaga <saketh.kalaga@sap.com>
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
@yankay yankay force-pushed the chore/602-upgrade-k8s-1.35 branch 2 times, most recently from f8b9ea1 to c754169 Compare June 11, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade Kubernetes dependencies and kind image to 1.35

4 participants