Optimizing the use of resources is a universal aspiration, but achieving it is far more complex than mere words might suggest. The process requires extensive performance testing, accurate server sizing, and numerous adjustments to resource specifications. These challenges persist and, indeed, become more nuanced within a Kubernetes environment than in traditional systems. At the heart of building a high-performance and cost-effective Kubernetes cluster is the art of efficiently managing resources by customizing your Kubernetes workloads.
Delving into the intricacies of Kubernetes, it is essential to understand the various components that interact when deploying applications on k8s clusters. During my research for this article, an enlightening LinkedIn article came to my attention highlighting the tendency of enterprises to overprovision their Kubernetes clusters. I propose solutions for companies to improve cluster efficiency and reduce costs.
Before proceeding, it is crucial to familiarize yourself with the terminology that will be prevalent in this article. This foundational section is designed to equip the reader with the necessary knowledge for further in-depth research.
Understanding the basics
- The floor: A pod represents the smallest deployable, buildable, and manageable unit in Kubernetes, consisting of one or more containers that share storage, network, and details on how to run the containers.
- Replicas: Replicas in Kubernetes are multiple Pod instances maintained by the controller for redundancy and scalability to ensure that the desired state matches the observed state.
- Implementation: A deployment in Kubernetes is a higher-level abstraction that manages the lifecycle of Pods and ensures that a number of replicas are running and up-to-date.
- Nodes: Nodes are physical or virtual machines that make up a Kubernetes cluster, each responsible for running Pods or workloads and providing the necessary server resources.
- Kube-planner: The kube-scheduler is a critical component of Kubernetes that selects the most appropriate node to run Pods based on resource availability and other scheduling criteria.
- Kubelet: Kubelet runs on each node in the Kubernetes cluster and ensures that the containers are running in Pod as specified in the pod manifest file. It also manages the life cycle of containers, monitors their health, and processes instructions to start and stop containers managed by the management level.
Understanding Kubernetes resource management
In Kubernetes, during pod deployment, you can determine the required CPU and memory – a decision that shapes the performance and stability of your applications. Kube-scheduler uses the resource requirements you set to determine the optimal node for your pod. At the same time, the kubelet enforces resource limits, ensuring that containers operate within their allocated share.
- Resource Requirements: Resource requirements guarantee that the container will have a minimum amount of CPU or memory available. The kube-scheduler considers these requests to ensure that the node has enough resources to host the modules, with the goal of evenly distributing workloads.
- Resource Limits: Resource limits, on the other hand, act as a safeguard against overuse. If the container exceeds these limits, it may face limits such as CPU throttling or, in the case of memory, interrupts to prevent resource starvation on the node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
spec:
replicas: 2
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
containers:
- name: example-container
image: nginx:1.17
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
Let’s break these concepts down with two illustrative cases:
Case 1 (without specified restrictions)
Imagine a pod with a memory requirement of 64Mi and a CPU requirement of 250m on a well-resourced node—4GB of memory and 4 CPUs. This pod can use more resources than requested without defined limits, borrowing from the excess node. However, this freedom comes with possible side effects; it can affect the availability of resources to other pods and, in extreme cases, cause system components such as kubelets to become unresponsive.
Case 2 (with defined requirements and restrictions)
In another scenario, a module with a memory requirement of 64Mi and a limit of 128Mi, together with a CPU requirement of 250m and a limit of 500m, resides on the same resource-rich node. Kubernetes will reserve requested resources for this pod, but will strictly enforce the set limits. Exceeding these limits in the case of memory may result in a restart or termination of the kubelet container or a reduction in CPU usage if the CPU is maintaining a harmonious balance on the node.
The double-edged sword of CPU limitations
CPU limits are designed to protect a node from overuse, but they can be a mixed blessing. These can trigger CPU throttling, affecting container performance and response time. This was noted by Buffer, where containers experienced throttling even when CPU usage was below defined limits. To address this, they isolated “No CPU Limits” services on specific nodes and fine-tuned their CPU and memory requirements through vigilant monitoring. While this strategy reduced container density, it also improved service latency and performance—a delicate trade-off in the pursuit of optimal resource utilization.
Understanding Kubernetes scaling
Now that we’ve covered the critical roles of requirements and constraints in workload deployment, let’s explore their impact on Kubernetes’ automated scaling. Kubernetes offers two primary scaling methods: one for pod replicas and one for cluster nodes, both critical to maximizing resource utilization, cost efficiency, and performance.
Horizontal Unit Auto Scaling (HPA)
Horizontal Pod Autoscaling (HPA) in Kubernetes dynamically adjusts the number of pod replicas in a deployment or replica set based on observed CPU, memory utilization, or other specified metrics. It is a mechanism designed to automatically horizontally scale the number of groups — not to be confused with vertical scaling, which increases resources for existing groups. HPA operates within defined minimum and maximum replica parameters and relies on metrics provided by the cluster metrics server to make scaling decisions. It is imperative to specify CPU and memory resource requirements in your module specifications, as they inform HPA’s understanding of each module’s resource usage and guide its scaling actions. HPA evaluates resource usage at regular intervals, scaling the number of replicas up or down to efficiently meet the desired target metrics. This process ensures that your application maintains performance and availability, even as workload requirements fluctuate.
The example below automatically adjusts the number of replica groups within a range of 1 to 10 based on CPU utilization, with the goal of maintaining an average CPU utilization of 50% across all groups.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: example-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Automatic cluster scaling
Cluster Autoscaler can automatically resize cluster nodes so that all modules have a place to run and there are no unnecessary nodes. It works by increasing the number of nodes during high demand when pods fail to start due to insufficient resources and decreasing the number when nodes are underutilized. The autoscaler evaluates the need to scale a node based on Pod resource requirements—pods that cannot be scheduled due to lack of resources will trigger the autoscaler to add nodes. In contrast, nodes that are underutilized over a period of time and have blocks that can be comfortably moved to other nodes will be considered for removal. This ensures economical cluster operation with optimized performance.
Conclusion
Optimization is not a one-time event but a continuous process. Rigorous load testing is critical to understanding application performance under varying levels of demand. Using monitoring tools such as NewRelic, Dynatrace or Grafana can reveal resource consumption patterns. Take the average resource utilization from several load tests and consider adding a buffer of 10-15% to accommodate unexpected spikes, adjusting as needed for your specific application needs. After determining the basic resource needs, deploy workloads with appropriately configured resource requirements and constraints. Ongoing monitoring is paramount to ensure that resources are being used effectively. Set up comprehensive alerting systems to notify you of underutilization and potential performance issues, such as slowdowns. This vigilance ensures that your workloads not only run, but run optimally. Organize your infrastructure by creating different groups of nodes tailored to different types of applications — such as those that require GPUs or large memory. In cloud environments, smart use of spot instances can lead to significant cost savings. However, always prioritize non-critical applications for these instances to protect business continuity if the cloud provider needs to restore resources.