DIVOC
DIVOC 3.0
DIVOC 3.0
  • Introduction to DIVOC
    • What DIVOC is and what it's not
    • DIVOC Docs Index
  • Platform
    • Release Notes
      • DIVOC 2.0 Release Features
      • DIVOC 3.0 Release Features
    • Specification
      • API Documentation
      • Setting up DIVOC development environment
    • DIVOC's Verifiable Certificate Features 2.0
      • Creating a DIVOC Certificate
        • Overview of DIVOC’s digital certificates
        • What information is included in the DIVOC certificate?
        • DIVOC’s certificate generation service: How does it work?
        • Compliance with internationally used COVID-19 certificate schemas
      • Distributing a DIVOC Certificate
      • Updating a DIVOC Certificate
      • Revoking a DIVOC Certificate
      • Verifying a DIVOC Certificate
      • DIVOC's Native COVID-19 Certificate Specification
      • DIVOC’s EU-DCC Adapter Service
      • DIVOC’s SHC Adapter Service
      • Adding a User Type in DIVOC
      • Printing Certificates at a Facility
      • Normal QR Code Versus Signed/Verifiable QR Code
      • What Information Goes Into a QR Code?
      • WHO Master Vaccine Checklist
      • EU Master Vaccine Checklist
    • DIVOC's Verifiable Certificate Features 3.0
      • How to Configure a New Tenant?
      • How to Access the VC System and Generate Tokens
      • How to Generate Certificates
      • How to Fetch Certificates
      • How to Update Certificates
      • How to Revoke Certificates
      • How to Suspend Certificates
    • DIVOC Architecture
    • Installation
      • Skills needed to set up DIVOC
      • Implementation Checklist
      • Setting Up DIVOC in k8 Cluster
        • How to Install DIVOC
        • How to Install DIVOC for V3.0
        • Backup & Restore: Postgres, Clickhouse, Kafka, & Redis
        • Infrastructure Recovery
        • Server Hardening
    • Verifiable Credential (VC): Production Deployment
    • Configuration
      • Configuring the Certification and Verification Component
        • Generating Signed Key Pairs
        • Configuring certificates
          • Step 1: Create a certification generation request
          • Step 2: Configure the QR code content
          • Step 3: Configure the certificate template
        • How to set up the verification portal for implementation
        • How to configure the update certificate API
        • Configuring Environment Variables in 2.0
      • Configuration Management Via ETCD
        • Adding a New Vaccine and ICD-11 Mapping
          • Adding a New Vaccine and ICD-11 Mapping Using ETCD CLI
        • PDF Template Change for Vaccine Certificates
          • PDF Template Change for Vaccine Certificates via ETCD CLI
        • EU Vaccine Configurations
          • Adding a New Vaccine and its Mapping via ETCD CLI
        • Payload Changes in the QR Code
          • Payload Changes in the QR Code via ETCD CLI
    • Performance Report
  • Products
    • Issuing COVID-19 Vaccination Certificates in India
    • Issuing COVID-19 Test Reports in India
    • Issuing COVID-19 Vaccination Certificates in Sri Lanka
    • Issuing COVID-19 Vaccination Certificates in the Philippines
    • Issuing COVID-19 Vaccination Certificates in Jamaica
      • Troubleshooting
    • Issuing COVID-19 Vaccination Certificates in Indonesia
    • Open Events
      • Past Events
      • DIVOC in the Media
  • DIVOC Demo
    • Program Setup (Via Orchestration Module)
    • Facility App
    • Issue and Verify Certificates
    • Citizen Portal
    • Feedback
    • Analytics
  • Community
    • Roadmap
    • Partner Support
      • Terms and Conditions of Using the DIVOC Site
      • Privacy Policy: Short Version for Display
      • Privacy Policy: Detailed
      • Platform Policy Guidelines
      • Privacy Policy Recommendations
      • Troubleshooting Guide
    • Source Code
    • Discussion Forum
    • Issues
    • Project Repo
Powered by GitBook
On this page
  • Kubernetes
  • Recovering control plane
  • Examples of what broken means in this context:
  • Runbook
  • Recover from the last quorum
  • Adding or replacing a node
  • Adding or replacing a worker node
  • Adding or replacing a control plane node
  1. Platform
  2. Installation
  3. Setting Up DIVOC in k8 Cluster

Infrastructure Recovery

PreviousBackup & Restore: Postgres, Clickhouse, Kafka, & RedisNextServer Hardening

Last updated 2 years ago

Kubernetes

Recovering control plane

  • To recover from broken nodes in the control plane, use the "" playbook.

  • Back up what you can.

  • Provision new nodes to replace the broken ones.

  • Place the surviving nodes of the control plane first in the "etcd" and "kube_control_plane" groups.

  • Add the new nodes below the surviving control plane nodes in the "etcd" and "kube_control_plane" groups.

Examples of what broken means in this context:

  • One or more bare metal node(s) suffer from unrecoverable hardware failure.

  • One or more node(s) fail during patching or upgrading.

  • Etcd database corruption.

  • Other node-related failures that leave your control plane degraded or nonfunctional.

  • Note: You need at least one functional control plane node to be able to recover using this method. If all control planes go down, there is no scope of recovery, and you will have to reinstall Kubernetes. Typically, even if the control plane goes down, the application still functions. Kubernetes functions like scaling, creating new pods, upgrading deployments, etc. will not work. The application, as is available, will continue to function.

Runbook

  • Move any broken etcd nodes into the "broken_etcd" group, make sure the "etcd_member_name" variable is set.

  • Move any broken control plane nodes into the "broken_kube_control_plane" group.

  • Run the playbook with --limit etcd,kube_control_plane, and increase the number of etdc retries by setting -e etcd_retries=10 or something even larger. The amount of retries required is difficult to predict.

  • Once you are done, you should have a fully working control plane again.

Recover from the last quorum

  • The playbook attempts to figure out if the etcd quorum is intact. If the quorum is lost, it will attempt to take a snapshot from the first node in the "etcd" group and restore from that.

  • To restore from an alternate snapshot, set the path to that snapshot in the "etcd_snapshot" variable: -e etcd_snapshot=/tmp/etcd_snapshot.

Adding or replacing a node

Removal of first kube_control_plane and etcd-master

Currently, you cannot remove the first node in your kube_control_plane and etcd-master list. If you still want to remove this node, you have to do the following:

  1. Modify the order of your control plane list by pushing your first entry to any other position, such as if you want to remove node-1 of the following example:

children: kube_control_plane: hosts: node-1: node-2: node-3: kube_node: hosts: node-1: node-2: node-3: etcd: hosts: node-1: node-2: node-3:

2. Run upgrade-cluster.yml or cluster.yml. After this, you are good to go on with the removal.

Adding or replacing a worker node

  1. Add a new node to the inventory.

  2. Run scale.yml. You can use --limit=NODE_NAME to limit Kubespray to avoid disturbing other nodes in the cluster. Before using --limit, run playbook facts.yml without the limit to refresh facts cache for all nodes.

  3. Remove an old node with remove-node.yml. With the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed. If the node you want to remove is not online, you should add reset_nodes=false and allow_ungraceful_removal=true to your extra-vars: -e node=NODE_NAME -e reset_nodes=false -e allow_ungraceful_removal=true. Use this flag even when you remove other types of nodes like a control plane or etcd nodes.

  4. Remove node from the inventory.

Adding or replacing a control plane node

  1. Append the new host to the inventory and run cluster.yml. You cannot use scale.yml for that.

  2. In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted to reload:

docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart

3. With the old node still in the inventory, run remove-node.yml. You need to pass -e node=NODE_NAME to the playbook to limit the execution to the node being removed. If the node you want to remove is not online, you should add reset_nodes=false and allow_ungraceful_removal=true to your extra-vars.

All content on this page by is licensed under a .

recover-control-plane.yml
eGov Foundation
Creative Commons Attribution 4.0 International License
Creative Commons License