TL;DR
Kubernetes v1.34 introduces GA of pod replacement policies in jobs, offering two options: TerminatingOrFailed (default) or Failed to manage when Pods are replaced, improving reliability for workloads like TensorFlow and JAX.
What happened
In Kubernetes v1.34, the Pod Replacement Policy feature reaches general availability. This feature allows Jobs to specify a policy that controls when replacement Pods start after an existing one fails or begins terminating.
Why it matters for ops
The new pod replacement policies help prevent issues like duplicate task registrations and unnecessary cluster scale-ups, ensuring more reliable job execution for workloads with strict requirements such as TensorFlow and JAX.
Action items
- Review the Kubernetes documentation on Pod Replacement Policy to understand how it can improve your Job management strategies.
- Consider using the Failed policy in Jobs where exactly one worker per index is required to prevent task registration conflicts.
- Monitor your cluster's resource usage and ensure that the new policies align with your workload needs.
Source link
https://kubernetes.io/blog/2025/09/05/kubernetes-v1-34-pod-replacement-policy-for-jobs-goes-ga/