StatefulSets are Kubernetes objects used to consistently deploy stateful application components. Pods created as part of a StatefulSet are given persistent identifiers that they retain even when they’re rescheduled.
A StatefulSet can deploy applications that need to reliably identify specific replicas, rollout updates in a pre-defined order, or stably access storage volumes. They’re applicable to many different use cases but are most commonly used for databases and other types of persistent data store.
In this article you’ll learn what StatefulSets are, how they work, and when you should use them. We’ll also cover their limitations and the situations where other Kubernetes objects are a better choice.
What Are StatefulSets?
Making Pods part of a StatefulSet instructs Kubernetes to schedule and scale them in a guaranteed manner. Each Pod gets allocated a unique identity which any replacement Pods retain.
The Pod name is suffixed with an ordinal index that defines its order during scheduling operations. A StatefulSet called
mysql containing three replicas will create the following named Pods:
Pods use their names as their hostname so other services that need to reliably access the second replica of the StatefulSet can connect to
mysql-2. Even if the specific Pod that runs
mysql-2 gets rescheduled later on, its identity will pass to its replacement.
StatefulSets also enforce that Pods are removed in reverse order of their creation. If the StatefulSet is scaled down to one replica,
mysql-3 is guaranteed to exit first, followed by
mysql-2. This behavior doesn’t apply when the entire StatefulSet is deleted and can be disabled by setting a StatefulSet’s
podManagementPolicy field to
StatefulSet Use Cases
StatefulSets are normally used to run replicated applications where individual Pods have different roles. As an example, you could be deploying a MySQL database with a primary instance and two read-only replicas. A regular ReplicaSet or Deployment would not be appropriate because you couldn’t reliably identify the Pod running the primary replica.
StatefulSets address this by guaranteeing that each Pod in the ReplicaSet maintains its identity. Your other services can reliably connect to
mysql-1 to interact with the primary replica. ReplicaSets also enforce that new Pods are only started when the previous Pod is running. This ensures the read-only replicas get created after the primary is up and ready to expose its data.
The purpose of StatefulSets is to accommodate non-interchangeable replicas inside Kubernetes. Whereas Pods in a stateless application are equivalent to each other, stateful workloads require an intentional approach to rollouts, scaling, and termination.
StatefulSets integrate with local persistent volumes to support persistent storage that sticks to each replica. Each Pod gets access to its own volume that will be automatically reattached when the replica’s rescheduled to another node.
Creating a StatefulSet
Here’s an example YAML manifest that defines a stateful set for running MySQL with a primary node and two replicas:
apiVersion: v1 kind: Service metadata: name: mysql labels: app: mysql spec: ports: - name: mysql port: 3306 clusterIP: None selector: app: mysql --- apiVersion: apps/v1 kind: StatefulSet metadata: name: mysql spec: selector: matchLabels: app: mysql serviceName: mysql replicas: 3 template: metadata: labels: app: mysql spec: initContainers: - name: mysql-init image: mysql:8.0 command: - bash - "-c" - | set -ex [[ `hostname` =~ -([0-9]+)$ ]] || exit 1 ordinal=$BASH_REMATCH echo [mysqld] > /mnt/conf/server-id.cnf # MySQL doesn't allow "0" as a `server-id` so we have to add 1 to the Pod's index echo server-id=$((1 + $ordinal)) >> /mnt/conf/server-id.cnf if [[ $ordinal -eq 0 ]]; then printf "[mysqld]\nlog-bin" > /mnt/conf/primary.cnf else printf "[mysqld]\nsuper-read-only" /mnt/conf/replica.cnf fi volumeMounts: - name: config mountPath: /mnt/conf containers: - name: mysql image: mysql:8.0 env: - name: MYSQL_ALLOW_EMPTY_PASSWORD value: "1" ports: - name: mysql containerPort: 3306 volumeMounts: - name: config mountPath: /etc/mysql/conf.d - name: data mountPath: /var/lib/mysql subPath: mysql livenessProbe: exec: command: ["mysqladmin", "ping"] initialDelaySeconds: 30 periodSeconds: 5 timeoutSeconds: 5 readinessProbe: exec: command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"] initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 1 volumes: - name: config emptyDir: volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi
This is quite a long manifest so lets unpack what happens.
- A headless service is created by setting its
None. This is tied to the StatefulSet and provides the network identities for its Pods.
- A StatefulSet is created to hold the MySQL Pods. The
replicasfield specifies that three Pods will run. The headless service is referenced by the
- Within the StatefulSet, an init container is created that pre-populates a file inside a config directory mounted using a persistent volume. The container runs a Bash script that establishes the ordinal index of the running Pod. When the index is 0, the Pod is the first to be created within the StatefulSet so it becomes the MySQL primary node. The other Pods are configured as replicas. The appropriate config file gets written into the volume where it’ll be accessible to the MySQL container later on.
- The MySQL container is created with the config volume mounted to the correct MySQL directory. This ensures the MySQL instance gets configured as either the primary or a replica, depending on whether it’s the first Pod to start in the StatefulSet.
- Liveness and readiness probes are used to detect when the MySQL instance is ready. This prevents successive Pods in the StatefulSet from starting until the previous one is Running, ensuring MySQL replicas don’t exist before the primary node is up.
An ordinary Deployment or ReplicaSet could not implement this workflow. Once your Pods have started, you can scale the StatefulSet up or down without risking the destruction of the MySQL primary node. Kubernetes provides a guarantee that the established Pod order will be respected.
# Create the MySQL StatefulSet $ kubectl apply -f mysql-statefulset.yaml # Scale up to 5 Pods - a MySQL primary and 4 MySQL replicas $ kubectl scale statefulset mysql --replicas=5
StatefulSets implement rolling updates when you change their specification. The StatefulSet controller will replace each Pod in sequential reverse order, using the persistently assigned ordinal indexes.
mysql-3 will be deleted and replaced first, followed by
mysql-2 won’t get updated until the new
mysql-3 Pod transitions to the
The rolling update mechanism includes support for staged deployments too. Setting the
.spec.updateStrategy.rollingUpdate.partition field in your StatefulSet’s manifest instructs Kubernetes to only update the Pods with an ordinal index greater than or equal to the given partition.
apiVersion: apps/v1 kind: StatefulSet metadata: name: mysql spec: selector: matchLabels: app: mysql serviceName: mysql replicas: 3 updateStrategy: rollingUpdate: partition: 1 template: ... volumeClaimTemplates: ...
In this example only Pods indexed
1 or higher will be targeted by update operations. The first Pod in the StatefulSet won’t receive a new specification until the partition is lowered or removed.
StatefulSets have some limitations you should be aware of before you adopt them. These common gotchas can trip you up when you start deploying stateful applications.
- Deleting a StatefulSet does not guarantee the Pods will be terminated in the order indicated by their identities.
- Deleting a StatefulSet or scaling down its replica count will not delete any associated volumes. This guards against accidental data loss.
- Using rolling updates can create a situation where an invalid broken state occurs. This happens when you supply a configuration that never transitions to the Running or Ready state because of a problem with your application. Reverting to a good configuration won’t fix the problem because Kubernetes waits indefinitely for the bad Pod to become Ready. You have to manually resolve the situation by deleting the pending or failed Pods.
StatefulSets also omit a mechanism for resizing the volumes linked to each Pod. You have to manually edit each persistent volume and its corresponding persistent volume claim, then delete the StatefulSet and orphan its Pods. Creating a new StatefulSet with the revised specification will allow Kubernetes to reclaim the orphaned Pods and resize the volumes.
When Not To Use a StatefulSet
You should only use a StatefulSet when individual replicas have their own state. A StatefulSet isn’t necessary when all the replicas share the same state, even if it’s persistent.
In these situations you can use a regular ReplicaSet or Deployment to launch your Pods. Any mounted volumes will be shared across all of the Pods which is the expected behavior for stateless systems.
A StatefulSet doesn’t add value unless you need individual persistent storage or sticky replica identifiers. Using a StatefulSet incorrectly can cause confusion by suggesting Pods are stateful when they’re actually running a stateless workload.
StatefulSets provide persistent identities for replicated Kubernetes Pods. Each Pod is named with an ordinal index that’s allocated sequentially. When the Pod gets rescheduled, its replacement inherits its identity. The StatefulSet also ensures that Pods get terminated in the reverse order they were created in.
StatefulSets allow Kubernetes to accommodate applications that require graceful rolling deployments, stable network identifiers, and reliable access to persistent storage. They’re suitable for any situation where the replicas in a set of Pods have their own state that needs to be preserved.
A StatefulSet doesn’t need to be used if your replicas are stateless, even if they’re storing some persistent data. Deployments and ReplicaSets are more suitable when individual replicas don’t need to be identified or scaled in a consistent order.