3 Important Optimizations Of Kubernetes For AI Workloads

3 Important Optimizations Of Kubernetes For AI Workloads
  • Vendors with experience in distributed cloud computing, high-volume ingress and egress control (i.e., telecoms) and an API security strategy are critical to mitigate risk when updating for an AI network infrastructure. 


With the excitement and energy about AI’s potential to revolutionize the way we communicate and do business, this is an ideal window to remind network managers of the critical role Kubernetes plays as the underlying application containers orchestration used throughout typical AI workflows. Many businesses and organizations may have already shifted some workloads from traditional or virtual infrastructure (where applications and their operating systems are limited by the assets and networking to a physical server) to a Kubernetes-based containerized one.

Because in Kubernetes infrastructures the containers are decoupled from the server, they are more agile since each container runs independently with a unique ready-run software package built in to provide more granular orchestration. Containers are abstracted at the OS layer and are portable, and can be spun up and down in seconds or milliseconds.

Kubernetes deployment has, up to now, been implemented to impact both IT performance and revenue outcomes. AI models and applications are pushing those functional Kubernetes needs even further.

Preparing for more data volume than ever

AI data movement and compute – which can be up to 100x of today’s compute volumes – will also amplify both known and unknown problems in operations, functionality and security at a similar magnitude on your legacy infrastructure. Planning along with a prevention mindset is key, especially if you plan to deploy AI in greenfield areas like the edge.

Planning along with a prevention mindset is key, especially if you plan to deploy AI in greenfield areas like the edge. 

There are many complex IT issues that can be overwhelming, but here are three key Kubernetes priorities that we believe should be optimized for any successful AI deployment –whether your AI program is being managed from Network Operations, Machine Learning Operations, AI Operations or even Security Operations roles.
 

1. Kubernetes is a necessity for distributed computing infrastructure that delivers high-volume AI workloads

Distributed computing systems are comprised of multiple computers, databases and devices that work together on separate tasks connected by a network. The workloads are split and run concurrently to rapidly adapt to deliver a larger workload than a single server, which includes backups and security functions.  The distributed computing infrastructure is critical to execute high-performance computing (HPC) acceleration for AI workloads, with features such as automating updates, self-healing, on-demand scaling acceleration of load balancing, and dynamic storage. Plan your distributed computing networks for maximum security as well as resilience and leverage managed Kubernetes tools from your multicloud hyperscaler (such as Amazon EKS, Microsoft AKS).

The Kubernetes basic resource management functionality gets elevated even further for AI: managing pods for CPU utilization, dynamically managing cluster sizes, defining the CPU/memory for each container to run. Additionally, Kubernetes already supports nodes using GPUs.
 

2. Kubernetes ingress adds new challenges to the data center

Diagram of a diagram

Description automatically generated

A functional shift often overlooked is that Kubernetes requires specific configuration to maximize performance for containers. In a containerized environment, connectivity between clusters happens horizontally between pods. But Kubernetes network design requires an infrastructure component to provide north-south network traffic ingress for clustered applications: Kubernetes ingress controllers. These are a point of control and load balancing to route layer 7 traffic from the internet to reach internal clusters.

A functional shift often overlooked is that Kubernetes requires specific configuration to maximize performance for containers. 

Kubernetes ingress control is not a default function for network traffic and provided by third parties. This management and integration of incoming service traffic to the Kubernetes cluster is determined by the infrastructure provider. Depending on your configuration and service partners, customization may be needed across clusters to maximize your unique scale and performance. When you consider that some organizations currently manage up to 50 production clusters,* it’s important to plan the customization and capabilities for vendor partnerships to maximize Kubernetes at the supercomputing scale required for AI workloads and apps.
 

3. AI models rely on APIs—which increase security risks

As discussed above, the Kubernetes ingress (and egress) control not only offers a management point from the internet to the clusters but is an ideal place to introduce incremental security into the new-to-the-network vulnerability points, such as authentication and authorization.

However, the bigger risks are application programming interfaces (API)s. APIs are a set of prebuilt rules and protocols which connect two different software systems at endpoints in order to avoid custom coding for connectivity. AI applications communicate almost exclusively with an AI model via APIs. There are different types of APIs deployed within the different AI models (LLMs, GenAI, etc.) intending to simplify speed and deployment. However, their ubiquity also adds functional complexity and even more security risks. From their quantity, locations within the ecosystem, and their need for discovery and management, API security has become one of the top 10 OWASP security risks for applications. API security risks include visibility and management of all APIs, API authorization, authentication and access to sensitive business.

To recap, when building for new AI models and applications, Kubernetes management and optimization will be more important than for current IT/business functionality. Experience with distributed cloud computing, high-volume ingress control for telecoms and an API security strategy will be instrumental to reduce risks when updating for an AI network infrastructure. As leader in multicloud application security and delivery for global telecoms and enterprises, F5 is committed to maximizing the benefits of Kubernetes and cloud-native infrastructure for successful AI deployments.

* Cloud Native Computing Foundation 2023 annual user survey dataset

link