
Karpenter v1: Smart Kubernetes Node Autoscaling Guide
Summary
Production guide to Karpenter v1: NodePools, EC2NodeClass, Spot, and consolidation on EKS.
Cluster Autoscaler served Kubernetes well for almost a decade, but the way it works — add a node by scaling an Auto Scaling Group with a fixed instance type — has been showing its age for years. Karpenter takes a different bet: forget Node Groups, forget pre-baked instance types, and instead let the autoscaler look at pending pods and pick the right EC2 instance in real time. In November 2024, AWS shipped Karpenter v1.0, and by 2026 it has become the default node autoscaler on EKS. Salesforce migrated more than 1,000 EKS clusters to it in production. If you run Kubernetes on AWS and have not moved yet, this guide gets you to a working v1 setup with NodePools, EC2NodeClasses, Spot, and consolidation.
This is a hands-on production guide, not a marketing tour. We will install Karpenter on an existing EKS cluster, configure a NodePool that mixes Spot and On-Demand, watch consolidation kick in, wire up Spot interruption handling through SQS and EventBridge, and walk through the gotchas that bite teams when they roll Karpenter to production. Estimated time: 30–45 minutes if your cluster is already up.
Prerequisites
- An EKS cluster on Kubernetes 1.29 or newer with an OIDC provider configured.
kubectl,helm3.13+,eksctl0.180+, and the AWS CLI configured with admin credentials.- Permission to create IAM roles, SQS queues, and EventBridge rules in the target account.
- An existing node group (managed or self-managed) for the Karpenter controller itself — you cannot run Karpenter on the nodes it provisions.
- Familiarity with Kubernetes scheduling concepts: requests, limits, taints, node selectors.
Why Karpenter v1 changes the autoscaling story
The classic Cluster Autoscaler model is: you pre-define a set of Auto Scaling Groups, each with a single instance type and AZ, and the autoscaler decides how many of each to run. If a pod needs a different shape, you have a fleet of ASGs to manage. Karpenter inverts this. There is no ASG. The Karpenter controller watches unschedulable pods, computes their requirements (CPU, memory, GPU, architecture, AZ, capacity type), and launches a single EC2 instance that fits using the EC2 Fleet API. When the pod is gone, the node is gone too. There is no warm pool, no minimum size for a particular shape, and no scale-to-zero hack — zero is the default.
In v1.0 the API became stable: v1 for both NodePool and EC2NodeClass. Two breaking changes from beta matter for upgrades. First, the alpha labels (karpenter.sh/do-not-evict, karpenter.sh/do-not-consolidate) became annotations (karpenter.sh/do-not-disrupt). Second, amiSelector now requires explicit aliases or IDs — there is no implicit latest AL2023. If you are on beta, run kubectl get nodepool -o yaml first and fix these before upgrading.
Step 1 — Provision IAM and the interruption queue
Karpenter needs two IAM roles — the controller role (manages EC2) and the node role (assumed by the instances it launches) — plus an SQS queue that receives Spot interruption and instance state-change events from EventBridge. The official Helm chart does not create these for you, so we use a small CloudFormation template shipped by the project. Set your cluster name once and reuse it as the Karpenter discovery tag everywhere.
export CLUSTER_NAME=prod-eks
export AWS_REGION=us-east-1
export KARPENTER_VERSION=1.7.0
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
# Download the official CFN template for v1
curl -fsSL "https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml" \
> karpenter-cfn.yaml
aws cloudformation deploy \
--stack-name "Karpenter-${CLUSTER_NAME}" \
--template-file karpenter-cfn.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides "ClusterName=${CLUSTER_NAME}"
This creates KarpenterControllerRole-${CLUSTER_NAME}, KarpenterNodeRole-${CLUSTER_NAME}, and an SQS queue named ${CLUSTER_NAME}. The EventBridge rules forward four events to the queue: Spot Instance Interruption Warning, Instance Rebalance Recommendation, Instance State-change Notification, and Scheduled Change. Karpenter consumes these and proactively drains nodes before the two-minute Spot termination window closes.
Step 2 — Authorize the node role on the cluster
The nodes that Karpenter launches need to register with the cluster, so you have to add the node role to the aws-auth ConfigMap (or, on EKS access entries, create an access entry of type EC2_LINUX). Skipping this is the single most common reason a Karpenter-launched node sits in NotReady for ten minutes.
eksctl create iamidentitymapping \
--cluster "${CLUSTER_NAME}" \
--region "${AWS_REGION}" \
--arn "arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}" \
--username "system:node:{{EC2PrivateDNSName}}" \
--group "system:bootstrappers" \
--group "system:nodes"
On clusters that use access entries instead of aws-auth, replace the command above with aws eks create-access-entry and aws eks associate-access-policy using the AmazonEKSAutoNodePolicy. Either way, verify with kubectl describe configmap aws-auth -n kube-system before continuing.
Step 3 — Install the Karpenter controller via Helm
Karpenter ships an OCI Helm chart. We install it into the kube-system namespace because cluster operators typically have stricter PodSecurity rules there, and the controller needs a stable home that is not itself managed by Karpenter. Crucially, you must pin the controller to your existing managed node group — not to nodes Karpenter provisions — with a node selector and tolerations.
helm registry logout public.ecr.aws 2>/dev/null || true
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace kube-system \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--set "nodeSelector.karpenter\.sh/controller=true" \
--wait
kubectl -n kube-system rollout status deployment karpenter
Label the existing managed node group nodes with karpenter.sh/controller=true so the deployment lands on them. If you skip this, Karpenter can end up scheduling its own controller pod on a Karpenter-provisioned node, which creates a chicken-and-egg startup problem after a cold cluster restart.
Step 4 — Define your first NodePool
A NodePool describes how Karpenter is allowed to provision nodes — which architectures, instance categories, capacity types, AZs, and which workloads it serves. Think of it as a set of constraints, not a fixed shape. Karpenter will pick the cheapest instance from the EC2 Fleet API that satisfies the constraints and your pending pod's requirements. Below is a sane production starting point: amd64 + arm64, m/c/r families, sizes 2xlarge or smaller, Spot first then On-Demand, with consolidation enabled.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values: ["nano", "micro", "small", "metal"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: 1000
memory: 1000Gi
weight: 10
A few things deserve attention. expireAfter: 720h recycles every node every 30 days, which is how Karpenter rolls AMI upgrades into your fleet without you running a chaos experiment. consolidationPolicy: WhenEmptyOrUnderutilized is new in v1 — it is the union of the old WhenEmpty and WhenUnderutilized policies. limits is your safety net: in a runaway pod loop, Karpenter will stop launching once cluster CPU hits 1000 or memory hits 1000 Gi. Do not deploy without these.
Step 5 — Define an EC2NodeClass
Where the NodePool describes the contract with workloads, the EC2NodeClass describes the AWS-side configuration: which AMI, subnets, security groups, instance profile, and user data. This separation is the single biggest reason v1 is cleaner than beta — you can swap AMIs across many NodePools by editing one EC2NodeClass.
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023
amiSelectorTerms:
- alias: al2023@v20260415 # pin a specific AMI version!
role: KarpenterNodeRole-prod-eks
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eks
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eks
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
encrypted: true
deleteOnTermination: true
metadataOptions:
httpEndpoint: enabled
httpPutResponseHopLimit: 1 # 1, not 2 — forces IRSA
httpTokens: required
detailedMonitoring: false
tags:
Cluster: prod-eks
ManagedBy: Karpenter
The most important line in the entire guide: alias: al2023@v20260415. Earlier examples used alias: al2023@latest — do not do that in production. AWS published a best-practice memo after a 2025 incident where an AL2023 update changed kernel behavior and rolled production-wide via Karpenter's drift detection in under an hour. Pin a known-good AMI, test newer versions in a staging cluster with a separate EC2NodeClass, and bump the alias on a schedule.
Step 6 — Tag your subnets and security groups
Karpenter discovers subnets and SGs via tag selectors, which means anything not tagged is invisible to it. This is good — it forces you to be explicit — but it surprises people who expect autodiscovery.
# Tag the private subnets where Karpenter is allowed to launch nodes
aws ec2 create-tags \
--resources subnet-aaaa subnet-bbbb subnet-cccc \
--tags Key=karpenter.sh/discovery,Value=prod-eks
# Tag the cluster security group
CLUSTER_SG=$(aws eks describe-cluster --name prod-eks \
--query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text)
aws ec2 create-tags \
--resources "${CLUSTER_SG}" \
--tags Key=karpenter.sh/discovery,Value=prod-eks
Apply the manifests and watch them reconcile.
kubectl apply -f nodepool.yaml
kubectl apply -f ec2nodeclass.yaml
kubectl get nodepool default -o yaml | grep -A5 status:
kubectl get ec2nodeclass default -o yaml | grep -A20 status:
Step 7 — Trigger your first scale-up
Deploy a deliberately oversized inflate workload to force a scale event. Watch Karpenter pick an instance type from EC2 Fleet, label the node, and bind the pods.
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector: { matchLabels: { app: inflate } }
template:
metadata: { labels: { app: inflate } }
spec:
terminationGracePeriodSeconds: 0
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests: { cpu: "1", memory: "1.5Gi" }
EOF
kubectl scale deployment inflate --replicas=20
# Watch Karpenter logs and node creation
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter -f &
kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type -w
Within 30–60 seconds you should see a single node attached, typically a Spot c7i.4xlarge or similar. Karpenter packs the 20 pods onto whatever it found cheapest — it does not blindly launch 20 nodes. Scale the Deployment back to 0 and within ~30 seconds (your consolidateAfter), the node is gone.
Step 8 — Verify Spot interruption handling
The SQS queue created in Step 1 only matters if events actually flow into it. The cheapest test is to use the AWS Fault Injection Service to simulate a Spot interruption on one of your Karpenter nodes; if FIS is not available, watch the queue while real Spot churn happens.
# Confirm Karpenter is consuming events
aws sqs get-queue-attributes \
--queue-url "https://sqs.${AWS_REGION}.amazonaws.com/${AWS_ACCOUNT_ID}/${CLUSTER_NAME}" \
--attribute-names ApproximateNumberOfMessages \
ApproximateNumberOfMessagesNotVisible
# Force a Spot interruption with FIS (one node)
aws fis start-experiment --experiment-template-id <your-template-id>
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter \
| grep -iE "spot interruption|drain|terminat"
Karpenter's expected behavior: receive the interruption warning, cordon the node, evict pods (respecting PDBs), launch a replacement, then delete the node before the two-minute window. If the queue is filling up but Karpenter is not draining, double-check the controller IAM role has sqs:DeleteMessage and sqs:ReceiveMessage on the queue ARN.
Common pitfalls and how to avoid them
- Pinning AMIs is non-negotiable. Use
amiSelectorTerms.aliaswith a dated tag, never@latest. AMI drift in Karpenter is fast — minutes, not hours. - Set
limitson every NodePool. A bad HPA + Karpenter combination can spend thousands of dollars in an hour. Treat the limits block as a billing safety net, not an optimization. - Do not run Karpenter on Karpenter-provisioned nodes. Pin the controller to a small managed node group via nodeSelector. After a cluster restart, the controller has to come up before it can provision nodes — if it is hosted on its own provisioned nodes, you have a deadlock.
- PodDisruptionBudgets matter more than ever. Consolidation is aggressive: a NodePool set to
WhenEmptyOrUnderutilizedwill move pods around constantly. Without PDBs, your stateful workloads will get drained mid-flush. - Kube-proxy and CNI must tolerate Karpenter taints. If you bootstrap nodes with custom taints (e.g.
karpenter.sh/unregistered), make sure all DaemonSets in your cluster tolerate them, or nodes will sit inNotReady. - Avoid mixing instance generations 5 and 6 in the same NodePool when running io-bound workloads. The EBS attachment limits differ; Karpenter does not factor that into bin packing for non-CPU/memory resources.
- Watch
karpenter_nodes_termination_time_seconds. If termination consistently exceeds yourterminationGracePeriodSeconds, your pods are being killed during eviction. TuneterminationGracePeriodon the NodePool to match. - Karpenter does not respect Cluster Autoscaler annotations. If you are migrating, audit pods for
cluster-autoscaler.kubernetes.io/safe-to-evictand translate tokarpenter.sh/do-not-disrupt.
Quick reference
| Concept | v1 API | Notes |
|---|---|---|
| NodePool | karpenter.sh/v1 | Workload-facing constraints. Set limits. |
| EC2NodeClass | karpenter.k8s.aws/v1 | AWS-side configuration. Pin AMI alias. |
| Disable disruption on a pod | karpenter.sh/do-not-disrupt: true | Annotation, not label, in v1. |
| Consolidation policy | WhenEmptyOrUnderutilized | Replaces WhenEmpty + WhenUnderutilized. |
| Spot interruption queue | settings.interruptionQueue | Required for Spot drain to work. |
| AMI pinning | alias: al2023@v20260415 | Never use @latest in production. |
| Cost ceiling | spec.limits.cpu / memory | Hard cap on cluster size per NodePool. |
Next steps
Once your default NodePool is steady, the natural extensions are: (1) split workloads across multiple NodePools with weights so latency-sensitive services get On-Demand and batch jobs ride pure Spot, (2) wire topology spread constraints so Karpenter rebalances across AZs during consolidation, and (3) export the Karpenter Prometheus metrics to your monitoring stack and alert on karpenter_pods_startup_time_seconds, karpenter_nodeclaims_disrupted_total, and queue depth on the interruption queue.
If you are running multi-tenant clusters, look into taints + dedicated NodePools per team and the new v1 spec.template.spec.kubelet block, which lets you pin maxPods, eviction thresholds, and CPU manager policy per pool. Once Karpenter is the only autoscaler in your cluster, consider deleting the lingering Cluster Autoscaler deployment to reclaim a node.
Comments
Be the first to comment