Troubleshooting Guide

The guide covers some of the most common issues we have encountered while working with our partners. It includes the following:

DIVOC services are up, but certificates are not being generated

Symptoms

  • Vaccination events are successful.

  • Certificates are not being generated.

  • All other DIVOC services are functioning.

  • Verification app is functioning.

  • Download of certificates is functioning.

  • All DIVOC services are running on the Kubernetes cluster.

Diagnosis

The most probable cause of this issue can be that the Redis server has stopped responding. This can be confirmed by running the following tests:

  • Check the status of Redis

- To get into the Redis container, use the command “redis-cli”.

- Inside Redis, use the command ‘PING’ -

1. If the response is ‘PONG,’ then the server is running.

2. If the response is blank or anything else other than ‘PONG,’ the server is not running.

- To get out of the Redis container, type ‘exit’.

Steps to resolve the incident

  • Restart Redis if it is unresponsive or down.

- Restarting the service mainly involves killing the service and starting it again.

- Procedure to kill a service:

1. For killing any service on a Linux machine, we need the process_id of the service. The process_id that we get as the output number for further steps in killing the service by running the command is: $ ps aux|grep redis.

2. For killing an unresponsive service on a Linux machine, replace the process_id_of_service with the value that you have noted down in the above step: sudo kill -9 process_id_of_service.

- Procedure to start a service:

1. After successfully killing the service, to start the service, run the command: $ sudo service redis-server start.

2. To check the status of the service, run: $ sudo service redis-server status.

- Note: Check the status of the redis-server as mentioned above.

Action to prevent similar issue in future

This problem occurs frequently when the resources allocated to the Redis cluster is very less. To ensure that the problem does not occur in the future, we advise to increase the server configuration to have atleast 16GB or more memory depending on the population that the installation serves. Another thing to consider is to have Redis run on its own server infrastructure instead of sharing resources with other software.

DIVOC services are down but servers are configured and functioning correctly

Symptoms

  • All servers are accessible through SSH.

  • Infrastructure is configured correctly - Kafka, Elasticsearch, Postgres, Redis.

  • Kubernetes worker nodes are running DIVOC services.

  • DIVOC services are not accessible through API endpoints.

Diagnosis

The most probable cause of this incident is that the Kubernetes client certificates have expired. Currently, to enable communication between the Master and worker nodes, Kubernetes certificate is set to 1 year. What this means is that every year we need to renew this certificate for continued delivery of the platform. This can be confirmed by running the following tests:

  • Master and slave nodes of the cluster are reachable.

  • On running “kubectl get pods -n divoc” command on the master node, you can an error saying “Client Certificates generated by kubeadm are expired. Can’t reach the cluster.”

  • Execute “kubeadm certs check-expiration” command on the master node to check the expiration of certificates.

  • This will identify that since the certificate has expired, kube-apiserver, kube-scheduler, kube-controller-manager services were not able to manage DIVOC services deployed on worker nodes.

Steps to resolve the incident:

Run the following commands on the master node to resolve the incident:

  • Take backup of the older config

- cp ~/.kube/config ~/.kube/dec-11-2022-expired-config

  • Renew the certificates

- kubeadm certs renew all

  • Restart Kubelet

- systemctl restart kubelet

  • Restart Kube-apiserver, kube-controller-manager, kube-scheduler, etcd so that they can use newly generated certificates.

- List all the running services in default namespace on Master node: docker ps.

- Stop & remove Kube-apiserver, kube-controller-manager, kube-scheduler and etcd services so that the services get restarted again -

1. docker stop <containerId>

2. docker rm <containerId>

  • As soon as these services are restarted, they will be able to restart the services (certificate-api) which went down on the slave node.

Action to prevent similar issue in future

In a managed kubernetes service, the certificates are auto-renewed upon expiry. In case of self-hosted/on-prem deployments of k8s cluster, the kube certificates have to be renewed manually. It is also possible to set the expiry to a time greater than a year, but this is not recommended. The best practice is to have calendar alerts for the appropriate dates.

DIVOC services take a long time to return

Symptoms

  • All DIVOC services are up and running.

  • All infrastructure services are up and running.

  • All servers are accessible over SSH.

  • DIVOC REST services are successful, but take a long time to execute.

Diagnosis

The most probable cause of this issue is that indexes are not present in the Postgres database table. This can be confirmed by the following steps:

  • Connect directly to the Postgres database using psql.

  • Check if the following indexes are present for the following columns in V_VaccinationCertificate table in the database:

- OSID

- certificateId

- Contact

- Mobile

- preEnrollmentCode

Steps to resolve the incident

After connecting to the Postgres registry database, run the following SQL commands to add Indexes on the columns.

  • CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_preEnrollmentCode_sqlgIdx" ON "public"."V_VaccinationCertificate" ("preEnrollmentCode");

  • CREATE UNIQUE CONCURRENTLY INDEX "public_V_VaccinationCertificate_certificateId_sqlgIdx" ON "public"."V_VaccinationCertificate" ("certificateId");

  • CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_contact_sqlgIdx" ON "public"."V_VaccinationCertificate" ("contact");

  • CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_mobile_sqlgIdx" ON "public"."V_VaccinationCertificate" ("mobile");

  • CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_osid_sqlgIdx" ON "public"."V_VaccinationCertificate" ("osid");

Action to prevent a similar issue in future:

This is a one-time activity that needs to be done as soon as the database tables/registry is created. This dependency exists because Sunbird-RC does not have the capability to add indexes on schema creation.

DIVOC services are not returning success HTTP status codes

Sometimes DIVOC services return non 2XX status codes. We have split this section into sub-sections depending on the various non 2XX status codes received.

Status Code: 401

You will get a 401 status code when the Authentication/Authorisation Bearer token being used has expired.

  • Open postman.

  • Create a POST request to /auth/realms/divoc/protocol/openid-connect/token endpoint.

  • Add the following parameters:

    - client-id as admin-api

    - Grant-type as client-credentials

    - Client_secret as <Value provided to you during installation>

  • Once the request is sent, you will receive the auth_token as part of the payload.

  • Modify the ADMIN_API_SECRET parameter within the divoc-config.yaml file.

  • Restart all the services using: kubectl rollout restart deployments -n <namespace of divoc installation>

Status Code: 405

  • Check if the Content-Type in the header section is set as ‘application/json’

  • If not, set the Content-Type as ‘application/json’

Status Code: 602

  • Check if the payload is missing any parameter value like ‘preEnrollmentCode’, ‘recipient.name’, etc.

  • If yes, add the missing parameter and check.

Status Code: 400

  • Check if the format of value in the payload or the json structure is as per the expected structure. For example, the format of date value, dose count is number or string, etc.

  • If not, correct the value type in the payload.

Status Code: 504

  • Check if the DIVOC system is reachable from the source system or if IP/domain of the DIVOC system is mapped correctly.

  • If not, correct the IP/Domain name or check the network

Status Code: 502

  • Check if all the DIVOC services required for the generation of certificates are up and running.

  • Steps to be followed to check if required services are running:

- Login in to the DIVOC orchestration server.

- Run this command: kubectl get pods -n <divoc namespace>

  • If any of the pods are down and dont have a active running container, do the following:

- Restart the pod with this command: kubectl rollout restart deployment <name of the deployment which is down> -n <divoc namespace>

- Run this command again: kubectl get pods -n <divoc namespace>

- Validate if all the deployments are up again.

- Check if you are able to generate the certificate.

  • If you are still not able to generate certificate, then check the logs of deployments one by one using this command: kubectl logs -f deployment/<deployment_name> -n <divoc_namespace>

Common Infrastructure Maintenance Issues

Pods restart frequently

  • If you run kubectl get pods -n <divoc-namespace> and see that the number of pod restarts is high. There can be multiple reasons to pod restarts:

- CPU limit is exceeded by pods: In this case, modify the deployment by increasing the requests and limits on the CPU.

- Memory limit is exceeded by pods: In this case, modify the deployment by increasing the requests and limits on memory.

- Memory issue in the machine on which Kubernetes (worker node) is installed: To address this, one can increase the number of worker nodes or increase the memory of worker nodes and then recreate pods if necessary.

- Code issue: Sometimes there can be an issue with the code or the configuration might be missing. In such cases, one needs to fix the bug.

How to apply OS updates or patches on DIVOC infrastructure

Typically, DIVOC infrastructure is built over a cluster of Kubernetes, Kafka, and Postgres. As part of the operations, one of the key tasks of the infrastructure team is to apply security patches and updates to the OS.

Note: One should never directly log into a machine and apply a patch directly on a cluster of nodes. This should be done only for standalone servers and not cluster-based servers. This problem does not occur in Cloud-managed infrastructure. This is an issue only on self-managed or on-premise infrastructure.

In this section, we will discuss how to apply patches to the cluster without bringing down the application.

The general guidelines when dealing with the cluster are the following:

  • Disconnect the server from the cluster.

  • Apply patches to the server.

  • Rejoin the server back to the cluster.

Last updated