How Salesforce Migrated from Cluster Autoscaler to Karpenter Across Their Fleet of 1,000 EKS Clusters
These articles are AI-generated summaries. Please check the original sources for full details.
How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters
Salesforce operates one of the world’s most complex Kubernetes platforms, managing over 1,000 Amazon EKS clusters. Facing challenges with scalability and efficiency of their previous auto scaling approach, Salesforce migrated to Karpenter, an open-source Kubernetes auto scaler built by AWS. This migration reduced scaling latency from minutes to seconds and improved node utilization.
Why This Matters
Traditional Kubernetes cluster scaling often relies on manual configuration of node groups and auto scaling, which becomes unsustainable at scale. Inefficient bin-packing and slow response to demand spikes can lead to wasted resources and degraded performance. Salesforce’s previous system suffered from these inefficiencies, creating operational bottlenecks and hindering innovation, with the potential for significant cost overruns.
Key Insights
- 1,000+ EKS clusters: Salesforce manages over 1,000 Amazon EKS clusters.
- Karpenter transition tool: Salesforce developed an in-house tool for safe and consistent migration to Karpenter.
- 5% cost savings: Salesforce achieved 5% cost savings in FY2026 through improved bin-packing and reduced idle capacity.
Working Example
metadata:
name: m5.8xlarge-min-300-max-2500
data:
k8s_instance_type: m6i.8xlarge
k8s_root_volume_size: '100'
k8s_root_volume_iops: '3000'
k8s_root_volume_type: 'gp3'
k8s_root_volume_throughput: '125'
k8s_min_node_number: '300'
k8s_max_node_number: '2500'
multi_az_provisioned_workers: 'false'
asg_launch_type: 'launch_template'
gpu_enabled: 'false'
Practical Applications
- Use Case: Salesforce enabled developers to self-define node pool requirements, accelerating infrastructure provisioning.
- Pitfall: Overly restrictive Pod Disruption Budgets (PDBs) can block node replacements during migration; proper PDB configuration is essential.
References:
Continue reading
Next article
How This Agentic Memory Research Unifies Long Term and Short Term Memory for LLM Agents
Related Content
Implementing DNS Governance in OpenShift with Red Hat Advanced Cluster Management
Secure OpenShift environments by using RHACM policies to monitor CoreDNS health and prevent configuration drift across multiple clusters.
Optimizing Cloud Economics: Why AWS Service Billing Fails Feature-Level Attribution
Learn how Arpit Gupta's team resolved a $180K monthly AWS bill crisis by implementing feature-level attribution and structured logging to identify a $34K compute cost spike.
Building a Serverless Scanner to Detect and Manage Zombie AWS Resources
Roberto Belotti developed aws-zombie-hunter, a container-based Lambda that identifies orphaned AWS resources across seven categories to reduce wasted cloud spend.