A quick overview of GKE Autopilot
Introduction
By now, most folks in the DevOps space know what Kubernetes is. It’s become pretty much the defacto standard for container orchestration and is a much-loved workhorse for IT teams all over the world. If you’ve ever worked with Kubernetes though, you’ll know that it can be a bit of a dark art to learn how to manage it properly, especially at scale.
Google Cloud Platform’s Kubernetes Engine has become a tried and tested, go-to solution for Kubernetes implementations because a lot of the hassle of setting up the control plane and handling the node pools is handled for you by Google. And who wouldn’t want the people that created Kubernetes to be managing the back end for you right?
That said, there is still a bit of a learning curve with GKE. This can often leave organizations apprehensive about taking the plunge. Sure GKE is a managed service but in terms of setting up, running, and scaling your clusters there’s still a lot to get your head around.
Yes, Cloud Run is a thing. And yes, Cloud Run is awesome. But, Cloud Run also has limitations which we won’t go into here but means that it doesn’t have the capabilities of a fully-fledged Kubernetes cluster because, well that’s not what it’s designed for.
So what other options are there for people who want the power of Kubernetes but not the hassle? Enter GKE Autopilot.
What is GKE Autopilot?
GKE Autopilot is what Google is calling “a revolutionary mode of operations for managed Kubernetes that lets you focus on your software, while GKE Autopilot manages the infrastructure”.
In short, Google is going to manage and optimize your Kubernetes cluster for you.
Rejoice! That sounds pretty awesome! For companies where most of the technical team are developers with little experience of infrastructure and even less time or desire to learn, this is an amazing option to run containerized workloads on GKE.
There are, of course, some caveats which we’ll talk about later.
What’s good about GKE Autopilot?
First, let’s talk about why GKE Autopilot is a really exciting feature. Here are five reasons why you should consider running your workload using GKE Autopilot:
1. Optimization
If you’re inexperienced with Kubernetes, optimizing your cluster can be a huge drain on your time (and patience). Kubernetes can get complex very quickly when running workloads at scale and managing that complexity can be a full-time job for some engineers.
GKE Autopilot applies learnings and best practices from real Google SRE experience. All you need to do is provide the specifications for your workload and GKE Autopilot will manage and maintain the nodes for you, applying the most optimal configuration for your application’s requirements.
2. Security
If you work in an environment where adherence to compliance is a matter of whether your company can legally trade or not, you’ll know that taking security seriously is a critical priority. Securing your cloud environment is a subject unto itself but the point is that security is a complex and often very time-consuming part of building any cloud infrastructure solution.
GKE Autopilot takes away a huge chunk of that hassle for you. As well as GKE’s built-in security features and hardening standards, Autopilot will implement things like Shielded GKE Nodes and Workload Identity. Shielded GKE Nodes are built on Shielded VMs and prevent attackers from being able to gain access to Pods and impersonate Nodes in your cluster. Workload Identity allows you to configure a Kubernetes account to use as a GCP service account which will be authenticated against GCP APIs, allowing you to have fine-grain control over individual applications and their permissions on the platform.
As well as all this Autopilot will configure, enable and disable various settings for you in accordance with Kubernetes best practice.
3. Management
With GKE Autopilot, the management of both your nodes and control plane is done for you. This is a very cool feature because control plane and node management are two of the primary reasons people avoid Kubernetes and is probably the thing that earns it the reputation of being “hard”.
What’s even better is that the management of your cluster is done by Google SREs who literally run massive scale clusters for, well, Google! That being considered, they probably know a thing or two about what they’re doing. Because of this, Google can now even offer 99.9% SLAs on Autopilot Pods running in multiple zones.
Still want to control maintenance windows and pod disruption budgets? That’s not a problem, you can still define this if you want to so you get to decide what updates get applied to your cluster and when.
4. Cost
One of the key stumbling blocks of adopting Kubernetes is that costs can quickly spiral as you scale to production. Cost control is one of the main pillars of a successful cloud operating strategy so one of the best features of GKE Autopilot is the fact that, because it only provisions the resources that your applications need, you are only billed for the resources you actually use. This is referred to as per-pod billing.
This takes out all the time-consuming guesswork in terms of trying to figure out the balance between cost and performance optimization for your workloads. You can let Autopilot spin up a Kubernetes cluster for you and know that the cost of that cluster translates directly to the requirements of your application.
If you have a workload that is pretty elastic in terms of scalability requirements, your billing will also flex too. No more nerve-wracking presentations to the CTO to try to justify the overprovisioned capacity on your clusters!
5. It’s still Kubernetes under the hood
If you’re a bit more experienced in the world of Kubernetes or you know that your workload has some complex requirements, don’t panic. Google has claimed that from the outset “Autopilot is GKE” and fully believes that you shouldn’t have to compromise. GKE Autopilot still supports things like StatefulSet, DaemonSet, third-party monitoring, and Helm charts etc.
There’s lots more on the roadmap too so if something you need isn’t supported yet there’s a good chance it will be in the future.
So what’s the catch?
Make no mistake, GKE Autopilot looks, on paper, to be a superb choice for anyone wanting to just focus on their applications and not have to worry about managing the infrastructure to run them. However, there are limitations to this freedom.
The key takeaway from the material I’ve read is that GKE Autopilot, by its very nature, takes away most of the control you have over your clusters.
“Hold on, I thought that was the point?” I hear you say. Sure, that’s exactly what it’s for. However, if you’ve already got a well-established DevOps capacity where you’re already using things like Terraform and Helm to configure your clusters, Autopilot is probably not going to be a good fit for you.
For a start, you can’t deploy it via Terraform (at least not at the time of writing this article). That might not be such a big deal but as I said before; if you have a well-established, fully automated, Infrastructure as Code pipeline already you may now have a somewhat manual deployment for your new Kubernetes cluster.
There are also quite a few features that are not currently supported on GKE Autopilot clusters. Some of these make sense due to the nature of Google SREs managing them for you but, you still might need to consider if this affects your current processes and governance.
Unsupported features
Below are some of the features that are not supported.
Application-layer secrets encryption
This is an additional layer of encryption that can be used for encrypting secrets in etcd using Cloud KMS.
Binary authorization
Allows you to automatically check the quality and integrity of software in your supply chain during image deployment.
Customer-managed encryption keys (CMEK)
You can take ownership and control of the provision and rotation of encryption keys in GCP instead of allowing Google to manage this for you. Some organizations require this as part of their wider security and governance program.
Google Group RBAC
Previously with Role-Based Access Control, you could only use Google user accounts or GCP service accounts. Google Groups for GKE allows you to use Google Groups for RBAC, allowing you to simplify your IAM implementation.
Kubernetes Alpha APIs
Some of the most experimental features of GKE are released on the Alpha APIs prior to new releases. This allows you to experiment with a roadmap feature but is never recommended for use in production.
Legacy authentication options
Legacy authentication methods are not recommended for GKE workloads and are disabled by default from version 1.12 of GKE. Authentication via these methods includes x509 certificates and static passwords.
Container Threat Detection
Container Threat Detection is a built-in service for the Security Command Centre which can monitor the state of container images and evaluate changes to them to assess and detect threats and malicious behavior.
Privileged Pods
Used for making administrative changes to nodes. Autopilot does not allow changes to nodes.
Pod Security Policies
Because GKE Autopilot enforces some security settings for you, PodSecurityPolicy, OPA Gatekeeper, and Policy Controller are not supported on GKE Autopilot clusters.
Unsupported Add-ons
In addition to the above, GKE Autopilot does not support the following Add-ons:
- Cloud Build
- Cloud Run
- Cloud TPU
- Config Connector
- Istio
- Kalm
- Usage Metering
That’s quite a lot of caveats and in all honesty, there are a whole lot more such as no SSH access to your nodes, no certificate signing requests, and no ability to convert an Autopilot cluster into a Standard GKE cluster.
Conclusion
So is GKE Autopilot right for you? Well, it depends. I see this as a really great solution for PoC application builds where you just want to get the thing running in a dev environment but you know that Cloud Run isn’t going to cut it for your requirements.
It’s also a great option for companies where the developers have little to no experience with infrastructure and want a secure, optimized, and production-ready environment to deploy their applications to without the overhead of managing it.
All in all, I’m glad there’s another option besides GKE Standard and Cloud Run for container workloads in GCP and I look forward to seeing where this new “serverless” option will take us with cloud-native, containerized application development in the future.