Karpenter v1: Smart Kubernetes Node Autoscaling Guide — ContentBuffer guide

Karpenter v1: Smart Kubernetes Node Autoscaling Guide

K
Kodetra Technologies··10 min read Advanced

Summary

Production guide to Karpenter v1: NodePools, EC2NodeClass, Spot, and consolidation on EKS.

Cluster Autoscaler served Kubernetes well for almost a decade, but the way it works — add a node by scaling an Auto Scaling Group with a fixed instance type — has been showing its age for years. Karpenter takes a different bet: forget Node Groups, forget pre-baked instance types, and instead let the autoscaler look at pending pods and pick the right EC2 instance in real time. In November 2024, AWS shipped Karpenter v1.0, and by 2026 it has become the default node autoscaler on EKS. Salesforce migrated more than 1,000 EKS clusters to it in production. If you run Kubernetes on AWS and have not moved yet, this guide gets you to a working v1 setup with NodePools, EC2NodeClasses, Spot, and consolidation.

This is a hands-on production guide, not a marketing tour. We will install Karpenter on an existing EKS cluster, configure a NodePool that mixes Spot and On-Demand, watch consolidation kick in, wire up Spot interruption handling through SQS and EventBridge, and walk through the gotchas that bite teams when they roll Karpenter to production. Estimated time: 30–45 minutes if your cluster is already up.

Prerequisites

  • An EKS cluster on Kubernetes 1.29 or newer with an OIDC provider configured.
  • kubectl, helm 3.13+, eksctl 0.180+, and the AWS CLI configured with admin credentials.
  • Permission to create IAM roles, SQS queues, and EventBridge rules in the target account.
  • An existing node group (managed or self-managed) for the Karpenter controller itself — you cannot run Karpenter on the nodes it provisions.
  • Familiarity with Kubernetes scheduling concepts: requests, limits, taints, node selectors.

Why Karpenter v1 changes the autoscaling story

The classic Cluster Autoscaler model is: you pre-define a set of Auto Scaling Groups, each with a single instance type and AZ, and the autoscaler decides how many of each to run. If a pod needs a different shape, you have a fleet of ASGs to manage. Karpenter inverts this. There is no ASG. The Karpenter controller watches unschedulable pods, computes their requirements (CPU, memory, GPU, architecture, AZ, capacity type), and launches a single EC2 instance that fits using the EC2 Fleet API. When the pod is gone, the node is gone too. There is no warm pool, no minimum size for a particular shape, and no scale-to-zero hack — zero is the default.

In v1.0 the API became stable: v1 for both NodePool and EC2NodeClass. Two breaking changes from beta matter for upgrades. First, the alpha labels (karpenter.sh/do-not-evict, karpenter.sh/do-not-consolidate) became annotations (karpenter.sh/do-not-disrupt). Second, amiSelector now requires explicit aliases or IDs — there is no implicit latest AL2023. If you are on beta, run kubectl get nodepool -o yaml first and fix these before upgrading.

Step 1 — Provision IAM and the interruption queue

Karpenter needs two IAM roles — the controller role (manages EC2) and the node role (assumed by the instances it launches) — plus an SQS queue that receives Spot interruption and instance state-change events from EventBridge. The official Helm chart does not create these for you, so we use a small CloudFormation template shipped by the project. Set your cluster name once and reuse it as the Karpenter discovery tag everywhere.

export CLUSTER_NAME=prod-eks
export AWS_REGION=us-east-1
export KARPENTER_VERSION=1.7.0
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Download the official CFN template for v1
curl -fsSL "https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml" \
  > karpenter-cfn.yaml

aws cloudformation deploy \
  --stack-name "Karpenter-${CLUSTER_NAME}" \
  --template-file karpenter-cfn.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides "ClusterName=${CLUSTER_NAME}"

This creates KarpenterControllerRole-${CLUSTER_NAME}, KarpenterNodeRole-${CLUSTER_NAME}, and an SQS queue named ${CLUSTER_NAME}. The EventBridge rules forward four events to the queue: Spot Instance Interruption Warning, Instance Rebalance Recommendation, Instance State-change Notification, and Scheduled Change. Karpenter consumes these and proactively drains nodes before the two-minute Spot termination window closes.

Step 2 — Authorize the node role on the cluster

The nodes that Karpenter launches need to register with the cluster, so you have to add the node role to the aws-auth ConfigMap (or, on EKS access entries, create an access entry of type EC2_LINUX). Skipping this is the single most common reason a Karpenter-launched node sits in NotReady for ten minutes.

eksctl create iamidentitymapping \
  --cluster "${CLUSTER_NAME}" \
  --region  "${AWS_REGION}" \
  --arn     "arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}" \
  --username "system:node:{{EC2PrivateDNSName}}" \
  --group   "system:bootstrappers" \
  --group   "system:nodes"

On clusters that use access entries instead of aws-auth, replace the command above with aws eks create-access-entry and aws eks associate-access-policy using the AmazonEKSAutoNodePolicy. Either way, verify with kubectl describe configmap aws-auth -n kube-system before continuing.

Step 3 — Install the Karpenter controller via Helm

Karpenter ships an OCI Helm chart. We install it into the kube-system namespace because cluster operators typically have stricter PodSecurity rules there, and the controller needs a stable home that is not itself managed by Karpenter. Crucially, you must pin the controller to your existing managed node group — not to nodes Karpenter provisions — with a node selector and tolerations.

helm registry logout public.ecr.aws 2>/dev/null || true

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "${KARPENTER_VERSION}" \
  --namespace kube-system \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --set "nodeSelector.karpenter\.sh/controller=true" \
  --wait

kubectl -n kube-system rollout status deployment karpenter

Label the existing managed node group nodes with karpenter.sh/controller=true so the deployment lands on them. If you skip this, Karpenter can end up scheduling its own controller pod on a Karpenter-provisioned node, which creates a chicken-and-egg startup problem after a cold cluster restart.

Step 4 — Define your first NodePool

A NodePool describes how Karpenter is allowed to provision nodes — which architectures, instance categories, capacity types, AZs, and which workloads it serves. Think of it as a set of constraints, not a fixed shape. Karpenter will pick the cheapest instance from the EC2 Fleet API that satisfies the constraints and your pending pod's requirements. Below is a sane production starting point: amd64 + arm64, m/c/r families, sizes 2xlarge or smaller, Spot first then On-Demand, with consolidation enabled.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["m", "c", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano", "micro", "small", "metal"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind:  EC2NodeClass
        name:  default
      expireAfter: 720h
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter:    30s
  limits:
    cpu:    1000
    memory: 1000Gi
  weight: 10

A few things deserve attention. expireAfter: 720h recycles every node every 30 days, which is how Karpenter rolls AMI upgrades into your fleet without you running a chaos experiment. consolidationPolicy: WhenEmptyOrUnderutilized is new in v1 — it is the union of the old WhenEmpty and WhenUnderutilized policies. limits is your safety net: in a runaway pod loop, Karpenter will stop launching once cluster CPU hits 1000 or memory hits 1000 Gi. Do not deploy without these.

Step 5 — Define an EC2NodeClass

Where the NodePool describes the contract with workloads, the EC2NodeClass describes the AWS-side configuration: which AMI, subnets, security groups, instance profile, and user data. This separation is the single biggest reason v1 is cleaner than beta — you can swap AMIs across many NodePools by editing one EC2NodeClass.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@v20260415        # pin a specific AMI version!
  role: KarpenterNodeRole-prod-eks
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eks
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eks
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
        deleteOnTermination: true
  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 1     # 1, not 2 — forces IRSA
    httpTokens: required
  detailedMonitoring: false
  tags:
    Cluster: prod-eks
    ManagedBy: Karpenter

The most important line in the entire guide: alias: al2023@v20260415. Earlier examples used alias: al2023@latest — do not do that in production. AWS published a best-practice memo after a 2025 incident where an AL2023 update changed kernel behavior and rolled production-wide via Karpenter's drift detection in under an hour. Pin a known-good AMI, test newer versions in a staging cluster with a separate EC2NodeClass, and bump the alias on a schedule.

Step 6 — Tag your subnets and security groups

Karpenter discovers subnets and SGs via tag selectors, which means anything not tagged is invisible to it. This is good — it forces you to be explicit — but it surprises people who expect autodiscovery.

# Tag the private subnets where Karpenter is allowed to launch nodes
aws ec2 create-tags \
  --resources subnet-aaaa subnet-bbbb subnet-cccc \
  --tags Key=karpenter.sh/discovery,Value=prod-eks

# Tag the cluster security group
CLUSTER_SG=$(aws eks describe-cluster --name prod-eks \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text)
aws ec2 create-tags \
  --resources "${CLUSTER_SG}" \
  --tags Key=karpenter.sh/discovery,Value=prod-eks

Apply the manifests and watch them reconcile.

kubectl apply -f nodepool.yaml
kubectl apply -f ec2nodeclass.yaml
kubectl get nodepool default -o yaml | grep -A5 status:
kubectl get ec2nodeclass default -o yaml | grep -A20 status:

Step 7 — Trigger your first scale-up

Deploy a deliberately oversized inflate workload to force a scale event. Watch Karpenter pick an instance type from EC2 Fleet, label the node, and bind the pods.

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector: { matchLabels: { app: inflate } }
  template:
    metadata: { labels: { app: inflate } }
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests: { cpu: "1", memory: "1.5Gi" }
EOF

kubectl scale deployment inflate --replicas=20

# Watch Karpenter logs and node creation
kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter -f &
kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type -w

Within 30–60 seconds you should see a single node attached, typically a Spot c7i.4xlarge or similar. Karpenter packs the 20 pods onto whatever it found cheapest — it does not blindly launch 20 nodes. Scale the Deployment back to 0 and within ~30 seconds (your consolidateAfter), the node is gone.

Step 8 — Verify Spot interruption handling

The SQS queue created in Step 1 only matters if events actually flow into it. The cheapest test is to use the AWS Fault Injection Service to simulate a Spot interruption on one of your Karpenter nodes; if FIS is not available, watch the queue while real Spot churn happens.

# Confirm Karpenter is consuming events
aws sqs get-queue-attributes \
  --queue-url "https://sqs.${AWS_REGION}.amazonaws.com/${AWS_ACCOUNT_ID}/${CLUSTER_NAME}" \
  --attribute-names ApproximateNumberOfMessages \
                    ApproximateNumberOfMessagesNotVisible

# Force a Spot interruption with FIS (one node)
aws fis start-experiment --experiment-template-id <your-template-id>

kubectl -n kube-system logs -l app.kubernetes.io/name=karpenter \
  | grep -iE "spot interruption|drain|terminat"

Karpenter's expected behavior: receive the interruption warning, cordon the node, evict pods (respecting PDBs), launch a replacement, then delete the node before the two-minute window. If the queue is filling up but Karpenter is not draining, double-check the controller IAM role has sqs:DeleteMessage and sqs:ReceiveMessage on the queue ARN.


Common pitfalls and how to avoid them

  • Pinning AMIs is non-negotiable. Use amiSelectorTerms.alias with a dated tag, never @latest. AMI drift in Karpenter is fast — minutes, not hours.
  • Set limits on every NodePool. A bad HPA + Karpenter combination can spend thousands of dollars in an hour. Treat the limits block as a billing safety net, not an optimization.
  • Do not run Karpenter on Karpenter-provisioned nodes. Pin the controller to a small managed node group via nodeSelector. After a cluster restart, the controller has to come up before it can provision nodes — if it is hosted on its own provisioned nodes, you have a deadlock.
  • PodDisruptionBudgets matter more than ever. Consolidation is aggressive: a NodePool set to WhenEmptyOrUnderutilized will move pods around constantly. Without PDBs, your stateful workloads will get drained mid-flush.
  • Kube-proxy and CNI must tolerate Karpenter taints. If you bootstrap nodes with custom taints (e.g. karpenter.sh/unregistered), make sure all DaemonSets in your cluster tolerate them, or nodes will sit in NotReady.
  • Avoid mixing instance generations 5 and 6 in the same NodePool when running io-bound workloads. The EBS attachment limits differ; Karpenter does not factor that into bin packing for non-CPU/memory resources.
  • Watch karpenter_nodes_termination_time_seconds. If termination consistently exceeds your terminationGracePeriodSeconds, your pods are being killed during eviction. Tune terminationGracePeriod on the NodePool to match.
  • Karpenter does not respect Cluster Autoscaler annotations. If you are migrating, audit pods for cluster-autoscaler.kubernetes.io/safe-to-evict and translate to karpenter.sh/do-not-disrupt.

Quick reference

Conceptv1 APINotes
NodePoolkarpenter.sh/v1Workload-facing constraints. Set limits.
EC2NodeClasskarpenter.k8s.aws/v1AWS-side configuration. Pin AMI alias.
Disable disruption on a podkarpenter.sh/do-not-disrupt: trueAnnotation, not label, in v1.
Consolidation policyWhenEmptyOrUnderutilizedReplaces WhenEmpty + WhenUnderutilized.
Spot interruption queuesettings.interruptionQueueRequired for Spot drain to work.
AMI pinningalias: al2023@v20260415Never use @latest in production.
Cost ceilingspec.limits.cpu / memoryHard cap on cluster size per NodePool.

Next steps

Once your default NodePool is steady, the natural extensions are: (1) split workloads across multiple NodePools with weights so latency-sensitive services get On-Demand and batch jobs ride pure Spot, (2) wire topology spread constraints so Karpenter rebalances across AZs during consolidation, and (3) export the Karpenter Prometheus metrics to your monitoring stack and alert on karpenter_pods_startup_time_seconds, karpenter_nodeclaims_disrupted_total, and queue depth on the interruption queue.

If you are running multi-tenant clusters, look into taints + dedicated NodePools per team and the new v1 spec.template.spec.kubelet block, which lets you pin maxPods, eviction thresholds, and CPU manager policy per pool. Once Karpenter is the only autoscaler in your cluster, consider deleting the lingering Cluster Autoscaler deployment to reclaim a node.

Comments

Subscribe to join the conversation...

Be the first to comment