Infrastructure Recovery

Kubernetes

Recovering control plane

  • To recover from broken nodes in the control plane, use the "recover-control-plane.yml" playbook.

  • Back up what you can.

  • Provision new nodes to replace the broken ones.

  • Place the surviving nodes of the control plane first in the "etcd" and "kube_control_plane" groups.

  • Add the new nodes below the surviving control plane nodes in the "etcd" and "kube_control_plane" groups.

Examples of what broken means in this context:

  • One or more bare metal node(s) suffer from unrecoverable hardware failure.

  • One or more node(s) fail during patching or upgrading.

  • Etcd database corruption.

  • Other node-related failures that leave your control plane degraded or nonfunctional.

  • Note: You need at least one functional control plane node to be able to recover using this method. If all control planes go down, there is no scope of recovery, and you will have to reinstall Kubernetes. Typically, even if the control plane goes down, the application still functions. Kubernetes functions like scaling, creating new pods, upgrading deployments, etc. will not work. The application, as is available, will continue to function.

Runbook

  • Move any broken etcd nodes into the "broken_etcd" group, make sure the "etcd_member_name" variable is set.

  • Move any broken control plane nodes into the "broken_kube_control_plane" group.

  • Run the playbook with --limit etcd,kube_control_plane, and increase the number of etdc retries by setting -e etcd_retries=10 or something even larger. The amount of retries required is difficult to predict.

  • Once you are done, you should have a fully working control plane again.

Recover from the last quorum

  • The playbook attempts to figure out if the etcd quorum is intact. If the quorum is lost, it will attempt to take a snapshot from the first node in the "etcd" group and restore from that.

  • To restore from an alternate snapshot, set the path to that snapshot in the "etcd_snapshot" variable: -e etcd_snapshot=/tmp/etcd_snapshot.

Adding or replacing a node

Removal of first kube_control_plane and etcd-master

Currently, you cannot remove the first node in your kube_control_plane and etcd-master list. If you still want to remove this node, you have to do the following:

  1. Modify the order of your control plane list by pushing your first entry to any other position, such as if you want to remove node-1 of the following example:

children: kube_control_plane: hosts: node-1: node-2: node-3: kube_node: hosts: node-1: node-2: node-3: etcd: hosts: node-1: node-2: node-3:

2. Run upgrade-cluster.yml or cluster.yml. After this, you are good to go on with the removal.

Adding or replacing a worker node

  1. Add a new node to the inventory.

  2. Run scale.yml. You can use --limit=NODE_NAME to limit Kubespray to avoid disturbing other nodes in the cluster. Before using --limit, run playbook facts.yml without the limit to refresh facts cache for all nodes.

  3. Remove an old node with remove-node.yml. With the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed. If the node you want to remove is not online, you should add reset_nodes=false and allow_ungraceful_removal=true to your extra-vars: -e node=NODE_NAME -e reset_nodes=false -e allow_ungraceful_removal=true. Use this flag even when you remove other types of nodes like a control plane or etcd nodes.

  4. Remove node from the inventory.

Adding or replacing a control plane node

  1. Append the new host to the inventory and run cluster.yml. You cannot use scale.yml for that.

  2. In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted to reload:

docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart

3. With the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed. If the node you want to remove is not online, you should add reset_nodes=false and allow_ungraceful_removal=true to your extra-vars.

Last updated