Disclaimer: As always the opinions in this post are my own and do not necessarily reflect the ones of my employer. At work we are currently running a large number of clusters in production and the overall experience has been positive. However there are some drawbacks and limitations that should be considered carefully before moving production workloads to IBM-Cloud Kubernetes Service.
A Quick Look at the Good & Bad
The following is a first draft of positive and negative experiences when working with the IBM-Cloud Kubernetes service. Each of these points will be explained and elaborated on in detail, to avoid any confusion.
Positive 👍
- fairly stable data-plane
- managed masters and workers
- multi-zone worker pools
- and a lot more…
Negative / Limitations ⚠️
- some components missing high availability
- control plane issues (e.g. IBM-Cloud dashboard, Kubernetes dashboard)
- IBM-Cloud IAM not completely reflected in Kubernetes RBAC
- ingress controller limitations
So lets start off with the positive things!
A Fairly Stable Data-Plane
End to end tests confirm an uptime of almost 100% for running workloads across all clusters. This has been measured with an external monitoring solution, which is pinging public ingress endpoints, that are exposing an application with 3 replicas. In the past months, there was only one short interruption in the eu-de region, which impacted a limited number of workloads for a few minutes.
Managed Masters and Workers
The masters are completely managed by IBM and do not have to get configured in any way. Updates of these masters however have to get triggered manually via the API by the user. The number and size of worker nodes, is specified by the user. Also updates are handled in the same way as for the masters. The provisioning process usually takes about 3 minutes for a node and does not require any user intervention / operations at all. Nevertheless, it is recommended to monitor the worker nodes in detail for operating system metrics like disk usage or cpu and memory consumption, as worker scaling is not done automatically yet.
Multi-Zone Worker Pools
Since about a month there is support for multi-zone worker pools. This enables the user to specify a desired number of workers with a specific size across multiple zones. This simplifies the provisioning of large numbers of workers in different zones and ensures high availability of workloads, even in case of a disaster in a specific zone. Public connections from the outside are automatically routed to healthy zones, via Kubernetes ingresses and load-balancers powered by CloudFlare.
All these things show that IBM-Cloud Kubernetes is ready for production, however the following limitations should be considered before moving any workloads.
Components Missing High Availability
While the public ingress is high available (HA), the same does not apply to private ingresses, which are only reachable within the private network. Multiple loadbalancers for private ingresses can be enabled, however then updating the DNS entries for routing to the “live” region has to be done by the user. Using the external-dns project can help with updating DNS entries, however this will not enable a completely instant and automated fail-over. Additionally, it seems like the masters are not yet fully HA, which would explain some issues when interacting with the Kubernetes API server during maintenances. This is of course not highly critical, as a downtime of the master generally does not impact running workloads.
Mixed Experience with the Control-Plane
Most of the issues with the control-plane are related to the IBM-Cloud dashboard in general. The interaction feels slow, the token lifetime is often very short and the multi-factor authentication is not integrated seamlessly between the infrastructure and service components of the IBM-Cloud. Additionally, the Kubernetes dashboard can get accessed directly from the IBM-Cloud dashboard without a proxy. However accessing the Kubernetes dashboard this way, feels very slow as-well and the information does not always reflect the actual state retrieved via the Kubernetes API. A workaround is accessing the Kubernetes dashboard via kubectl proxy
, which will result in a nice and instant user-experience. Unfortunately this currently only works for cluster admins, as the regular developer group is missing necessary permissions.
In contrast to the experience with the IBM-Cloud dashboard, interacting with the Kubernetes API feels instant. Retrieving information via the Kubernetes CLI works great. Therefore, the bad experience mostly stems from interacting with the IBM-Cloud API.
IBM-Cloud IAM and Kubernetes RBAC
IBM-Cloud has some pre-defined IAM groups and roles, which get assigned specific permissions inside Kubernetes. However these pre-defined roles can currently not be used for assigning custom RBAC permissions inside Kubernetes, as these groups are not visible inside Kubernetes. At the moment a list of users is directly granted access via a RoleBinding, therefore the IBM-Cloud groups are not reusable.
Ingress Controller Limitations
Apart from the private ingress not being HA, there is a limitation when using multiple sub-paths on a single hostname. Currently, all sub-paths have to be specified in a single Kubernetes ingress resource, which can be an issue when working with multiple applications or teams under the same hostname. Also some options (e.g. compression) are missing when compared to the default nginx-controller. So checking the list of available ingress annotations is mandatory, before moving to IBM-Cloud Kubernetes. The only workaround to resolve these issues, is currently to replace the IBM-Cloud controller with one of the community ingress controllers.
From the time of testing the service back in 2017 until now, a lot of improvements have already been made. Even now, new features are added regularly, which helps maturing the service. This of course also shows that using the service is not yet a fully complete experience.
Hopefully this overview of some of the positive and negative aspects, can help with the decision to go for the IBM-Cloud Kubernetes Service, or to still hold back on deploying critical workloads there. Most of the limitations are minor, but can have a bigger impact, depending on individual use-cases.