A Summary of KubeCon for Busy People
By Paul Burt
Published 25th November 2024
Looking for a summary of a few of the topics covered during KubeCon NA 2024 in SLC? Consider this post a quick and easy way to catch up on the news at and surrounding KubeCon.
KubeCon was split across three days, with a major focus area for each day. Day one went to artificial intelligence and platform engineering. Day two to security. And finally, day three was about the community looking back and looking forward.
AI and the unique attributes of those types of workloads were a major overarching theme. This theme has persisted from past KubeCons and is likely to continue at future events.
If you’re looking for a single quick takeaway, it’s that AI/ML workloads are unique, and we need to modify our solutions and approaches to fit those distinct characteristics. The ideology applied to cloud native and most solutions still works well. The techniques and technologies just need to be adapted to new requirements. For instance, consider the longer-lived requests for an answer from a LLM. Best practices are still being discovered and shared. Other important and buzzy topics from KubeCons past also carried over. Things like security, networking, multi-tenancy, eBPF, developer experience, and wasm.
Anyways, here are the highlights announcements from KubeCon NA 2024 SLC!
Day one, AI and platform engineering
KubeCon day one kicked off with a non-AI and non-platform engineering topic, but one that’s still very important. That is, fighting patent trolls. The CNCF announced an initiative to rally the community to help find prior art and other info that’s critical for fighting off patent trolls.
The initiative is called the cloud native heroes challenge. Read more about how to get involved in the link.
Other highlights from day one include:
- In 2020 CERN processed 100+ petabytes of data, and they expect to 10x that by 2030.
- Kueue is the project CERN expects to help them meet that challenge.
- A 5G satellite network is running on cloud native tech like k3s.
- Close to 80% of interruptions tracked across 50 days of training Meta’s Llama model were due to hardware interruptions.
- CoreWeave highlighted how they perform additional checks and collect additional metrics as their way of cutting failure rates for jobs in half. For improved reliability, metrics like temperature, and hardware health need to happen in addition to the usual suspects.
- Lunar shared how platform engineering principles translate nicely when working with GenAI. They are a bank, and now see 60% of their customer interactions resolved with AI.
The days leading up to KubeCon also saw a number of announcements and donations to the CNCF.
- Intel started Open Platform for Enterprise AI, or OPEA. It’s now being donated to the Linux Foundation (the organisation that governs the CNCF).
- Think of OPEA like a reference architecture. It’s a set of patterns for building microservices and other infra that supports GenAI.
- WasmCloud was accepted as an incubating project.
- Ever wanted to run WASM at scale instead of containers? wasmCloud is worth checking out.
- Karpenter’s beta for 1.0.0 is here.
- Karpenter is a project from AWS for managing compute capacity, and works by watching for pods that are unschedulable.
If I had to recommend just one video to watch related to the themes of AI and platform engineering, I would check out Idit Levine’s presentation. It’s on the many ways that GenAI traffic and web traffic differ, and why it’s smart to look to Envoy as an LLM Gateway.
What is an LLM Gateway? The goal is similar to a service mesh, where there’s a lot of logic needed for managing traffic, so you want to pull the logic for handling that outside of your actual apps. Idit points out that traffic for GenAI is quite different from the usual web traffic we see. Web traffic might send a response within milliseconds, while an LLM might need seconds to minutes to send a response. As a result, it’s clear that LLM usage is different enough to warrant a new product category like LLM Gateways.
You might find you want to switch which LLM is providing answers for you based on current costs and time to respond. Traditionally, changing that might mean you need to re-deploy all of your apps to change your LLM target. Contrast that with a solution like Envoy, where you can push configuration changes to the proxies all at once.
Day two, Security
Similar to day one, the security day had a few additional themes. Contributors were celebrated with an award ceremony, with Tim Hockin landing a lifetime achievement award.
Other highlights from day two include:
- A closer look at the history of In-Toto, a supply chain security tool.
- We’re all invited to get involved with the CNCF security community by checking out the OpenSSF or the security TAG group.
- The Guac project was highlighted as a great way to visualise points of failure in your software supply chain.
- Predictions were made about AI BOMs becoming a standard in the future.
- Reference architectures were announced by the End-user TAB.
- Currently, you can view architectures from Adobe and Allianz-Direct.
- Bringing back the CNCF technology landscape radar.
- Think of it like a report on the maturity and usage of cloud native tech.
Similarly, some of the announcements at or leading up to KubeCon got some stage time or were announced during speaking sessions.
- RedHat promised to donate PodMan and bootc to the CNCF.
- A collaboration between engineers at Bloomberg and Tetrate produced Envoy AI Gateway.
- Solo.io’s Gloo project has been donated to the CNCF under a new name, K8sGateway.
- Solo.io suggested it's best thought of as a way to create consistency and unify approaches for ingress, egress, and east-west traffic.
- Prometheus released v3.0. The announcement and highlights are on their blog and in a recorded talk.
- Prometheus is one of the most popular ways to collect and query metrics about cloud native workloads.
If I had to recommend just one video to watch related to security, it would be Mish-mesh: Abusing the service mesh to compromise Kubernetes environments by H. Ben-Sasson and N. Ohfeld. This presentation is about legitimate features that attackers might use to escalate privileges. In other words, there’s nothing like exploiting out-of-date software. The talk showcased one compromise of Linkerd, and one for Istio.
For the Linkerd case, the Wiz researchers made some assumptions about how Azure’s ML infrastructure works. They assumed it was likely Kubernetes, and their first goal was to learn more about what they were allowed to interact with, and what they were not. They focused on a part of the Azure ML system that allows you to input a URL for the system to analyze. They found the URL was not sanitized, which they smartly saw as a Server Side Request Forgery (SSRF) opportunity.
Armed with this new knowledge, Ben-Sasson and Ohfeld decided to use the form to scan the entire port range of localhost (127.0.0.1). After filtering the results, they found an open port for 4191. That’s the Linkerd sidecar container port. Researching Linkerd, they found endpoints for /shutdown, /env.json, and /metrics. Is the metrics endpoint useful? Sure is! It gives IP addresses for internal hosts, ports, and service account names.
Fast forwarding, this eventually led to them gaining access to Prometheus, GoldPinger, Nginx Ingress Controller, and Secret Store Metrics. For details on how they got there, I highly recommend checking out the talk.
Ben-Sasson and Ohfeld closed by noting that both Linkerd and Istio are robust solutions. Any solution will have some kind of attack surface area. The best way to deal with that threat is defence in depth. They had more specific recommendations, as well.
Observability features are valuable, but they should only be accessible from trusted environments.
- Assess new Kubernetes components with an offensive outlook.
- Properly segment your Kubernetes networks.
- Separate between the data plane and the control plane.
- Enforce critical rules at the Kubernetes level in addition to the service mesh.
- Use multiple security barriers.
- Assume the first line of defence will always be bypassed. What’s beyond that?
To hear more about how the Wiz team found issues with Linkerd and Istio, check out the full talk.
Day three, looking forward and back
Day three kicked off with a presentation about the 12 Factor app and the historical context of where the manifesto came from. Undoubtedly the most fun of the day three keynotes was the Family Fued style gameshow, pitting Kubernetes aficionados against each other.
Other highlights from day three include:
- A review of the traits that make a successful project in the CNCF by the technical oversight committee (TOC).
- There is a clear governance, regular release cadence, clearly defined roadmap, and a well defined scope.
- The most successful projects are simple to adopt and abstract away complexity from the practitioner.
- Congratulations to the projects that graduated this year. Those are cert-manager, dapr, KubeEdge, and Falco.
- The observation that when Kubernetes was released, it was unfinished. Kubernetes is still not “done” yet. The tech we use is always going to be evolving, and that’s part of what makes the cloud native community fun to be a part of.
Similar to previous days, some of what was announced leading up to KubeCon got some stage time, like:
- Heroku open sourcing the 12 Factor App.
- The 12 Factor app was a set of principles proposed as best practices for developing SaaS apps in the early days.
- Microsoft donating Hyperlight to the CNCF as a sandbox project.
- Hyperlight creates microVMs in as littles as one to two milliseconds. The use case is to run individual functions in these tiny VMs, just for the lifecycle of the function.
If I had to recommend just one video to watch related to looking forward and looking back, I’d check out Andrew L'Ecuyer from Crunchy Data’s presentation. That is Engineering a Kubernetes Operator: Lessons Learned from Versions 1 to 5. Operators are essential for running any kind of stateful workload on Kubernetes. This was a look at CrunchyData’s Postgres operator, which goes by the short name PGO.
Andrew’s talk covered three major areas. They are high availability (HA), upgrades, and disaster recovery.
High availability for an Operator is complex, because you’re managing both the availability of the Operator itself in addition to whatever it happens to be managing. Additionally, at every step of making the Operator they were faced with a design choice between doing things the Kubernetes way or by using standard Postgres tools.
Andrew noted that versions 1 through 3 of the Operator worked well, but there was definitely some room for improvement. Specifically, that there was only a single instance of the Operator deployed in these early versions. So, the Operator crashing could result in your Postgres instances being unmanaged. Additionally, he noted that all operators use a queuing mechanism to capture and respond to events in a K8s cluster. This is problematic if multiple DBs crash at the same time. You don’t want to be stuck waiting in line for the Operator to take action.
For upgrades, Andrew found that different strategies worked best for minor and major version updates. Minor updates were relatively simple, since there’s usually API compatibility guarantees. A rolling update strategy works well in that case.
For major updates, CrunchyData needed to use the PGUpgrade API. This is a process that can potentially result in brief downtime, so a design decision was made to ask engineers to annotate the instances they wanted to upgrade.
For disaster recovery, they realised these features might serve additional purposes. For example, PGO might need to cover scenarios that involved crossing k8s cluster and potentially even cloud boundaries. That meant that the DR features also had the potential to assist with data mobility.
Andrew wrapped up the talk with some lessons learned. The two that stood out to me were:
- Prevention is better than preparedness (although, you certainly need to do both).
- Operators will ultimately want to combine both Postgres native solutions and Kubernetes native solutions. For example, PGBackRest and k8s volume snapshots together in concert can help to prevent corruption.
List of all KubeCon keynotes & talks
If you’re curious to hear and see more about KubeCon NA 2024 in SLC, you can find a list of all of the keynotes and sessions in this CNCF playlist. As of this writing, there are 373 videos in the playlist. So, if you are browsing, I recommend checking the Sched app for better descriptions of sessions, and tools for filtering by topic.
I would also be remiss not to mention that Northflank’s CEO, Will Stewart, gave a lightning talk at Platform Engineering Day. You can catch his ~6 minute talk on creating better abstractions for platform engineering.
Also from the blog