Common Infrastructure Issues and their Recovery Instructions

Levels of Support

L1: It is the initial level of support provided by the user help desk. They help to screen the issues and typically handle queries like "how to," FAQs, user creation, password resets, etc.

L2: It deals with support tickets that can be resolved by doing basic configuration in the application or suggesting workarounds. Other activities typically include environment management e.g. server monitoring, server management etc. For L2 support, we expect a team of infrastructure management-related skill sets.

L3: It deals with tickets typically requiring minor country-specific code changes (certificate templates, logo, UI, and not core platform code), analysis of changes in new/patch versions, data queries, handling environment issues that cannot be resolved by L2 staff. For L3 support, we expect a team of software engineering-related skill sets.

L4: It deals with tickets related to product enhancements or product defects. This would typically be worked on by the DIVOC team, which, in turn, will either release a hotfix, patch release, or bundle it in the next release, or defer/deprioritise.

Troubleshooting Guide for L2

1. If you are getting the response as 401:

  • Possible causes: Token has expired.

- Check access token is valid or correct for API call.

  • Action to be taken:

- Open postman.

- Create a POST request to /auth/realms/divoc/protocol/openid-connect/token endpoint.

- Add the following parameters:

client-id as admin-api

Grant-type as client-credentials

Client_secret as

- Once the request is sent, you will receive the auth_token as part of the payload.

- Modify the ADMIN_API_SECRET parameter within divoc-config.yaml file.

- Restart all the services using: kubectl rollout restart deployments -n <namespace of divoc installation>

2. If you are getting the response as 405:

  • Action to be taken:

- Check if the Content-Type in the header section is set as ‘application/json’

- If not, set the Content-Type as ‘application/json’

3. If you are getting the response as 602:

  • Action to be taken:

- Check if the payload is missing any parameter value like ‘preEnrollmentCode’, ‘recipient.name’, etc.

- If yes, add the missing parameter and check.

4. If you are getting the response as 400:

  • Action to be taken:

- Check if the format of value in payload or json structure is as per the expected structure. For example - format of date value, dose count is number or string, etc.

- If not, correct the value type in the payload.

5. If you are getting the response as 504:

  • Action to be taken:

- Check if the DIVOC system is reachable from the source system, or if the IP/domain of the DIVOC system is mapped correctly.

- If not, correct the IP/domain name or check the network.

6. If you are getting response as 502: Bad Gateway:

  • Action to be taken:

- Check if all the DIVOC services required for the generation of certificates are up and running.

- Steps to be followed to check if required services are running:

Login in to the DIVOC server.

Go to the deployment folder.

Run this command: kubectl get pods -n <divoc namespace>

- If any of the pods are down and do not have an active running container:

Restart the pod with this command: kubectl rollout restart deployment

<name of the deployment which is down> -n <divoc namespace>

Run this command again: kubectl get pods -n <divoc namespace>

Validate if all the deployments are up again.

Check if you are able to generate the certificate.

- If you are still not able to generate the certificate, then check the logs of deployments one by one using this command: kubectl logs -f deployment/<deployment_name> -n <divoc_namespace>

- If you find any errors in the logs or if the logs are not clear to you, share the logs with the L3 team for resolution of the issue.

7. If the gateway service is down:

  • Action to be taken:

- Try restarting the gateway service: kubectl rollout restart deployment gateway -n <divoc_namespace>

- If the service does not start, look at the deployment logs and pass on the information to the L3 team: kubectl logs -f deployment gateway -n <divoc_namespace>

8. If you are trying to generate/update a certificate, check if the vaccination API service is down:

  • Action to be taken:

- Try restarting the vaccination api service: kubectl rollout restart deployment vaccination-api -n <divoc_namespace>

- If the service does not start, look at the deployment logs and pass on the information to the L3 team: kubectl logs -f deployment vaccination-api -n <divoc_namespace>

9. If the certificate signer service is down:

  • Action to be taken:

- Try restarting the certificate signer service: kubectl rollout restart deployment certificate-signer -n <divoc_namespace>

- If the service does not start, look at the deployment logs and pass on the information to the L3 team: kubectl logs -f deployment certificate-signer -n <divoc_namespace>

10. If the registry services are down:

  • Action to be taken:

- Try restarting the registry service: kubectl rollout restart deployment registry -n <divoc_namespace>

- Try connecting to the database directly using the following command: psql -h <DB_ADDRESS> -U

a. If you are able to access the registry, look at the deployment logs and pass on the information to the L3 team: kubectl logs -f deployment registry -n <divoc_namespace>

b. If you are unable to connect to the database, restart the database and try connecting again. If the problem persists, reach out to the L3 team.

11. If you are trying to fetch the certificate, check if the certificate API services are down:

  • Action to be taken:

- Try restarting the certificate api service: kubectl rollout restart deployment certificate-api -n <divoc_namespace>

- If the service does not start, look at the deployment logs and pass on the information to the L3 team: kubectl logs -f deployment certificate-api -n <divoc_namespace>

12. If the SMS/notification services are down:

  • Action to be taken:

- Regenerate a new SMS Auth Key from the SMS provider.

- Update SMS_AUTH_KEY property in divoc-config.yaml.

- Restart notification service: kubectl rollout restart deployment notification-service -n <divoc_namespace>

13. If the services are taking a long time to return:

  • Possible causes: Indexes not present in database.

  • Action to be taken:

- Check if the following indexes are present for the following columns in VaccinationCertificate DB table in the database:

a. OSID

b. certificateId

c. Contact

d. Mobile

e. preEnrollmentCode in

- If they are not present, run the following commands:

a. CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_preEnrollmentCode_sqlgIdx" ON "public"."V_VaccinationCertificate" ("preEnrollmentCode");

b. CREATE UNIQUE CONCURRENTLY INDEX "public_V_VaccinationCertificate_certificateId_sqlgIdx" ON "public"."V_VaccinationCertificate" ("certificateId");

c. CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_contact_sqlgIdx" ON "public"."V_VaccinationCertificate" ("contact");

d. CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_mobile_sqlgIdx" ON "public"."V_VaccinationCertificate" ("mobile");

e. CREATE INDEX CONCURRENTLY "public_V_VaccinationCertificate_osid_sqlgIdx" ON "public"."V_VaccinationCertificate" ("osid");

14. If signed certificates are not being created when vaccination events occur:

  • Possible causes: Redis server is down.

  • Possible actions:

- Check if you are able to connect to redis server using redis-cli: redis-cli -h <IP ADDR of server>

- If you are not able to connect, then restart the server.

a. SSH into the redis server.

b. List the redis-server process: sudo service redis-server status.

c. Fetch the process-id of redis-server.

d. Kill the redis-server process (sudo kill -9).

e. Restart redis-service process (sudo systemctl restart redis).

f. Confirm that we are now able to connect to redis-server using “redis-cli” command.

Infrastructure Issues

  1. Increase the limit on the number of times a certificate could be updated:

Update the “divoc-config.yml” file with a new value (greater than the default value of 100) for “CERTIFICATE_UPDATE_LIMIT” property and apply it. Kubectl rollout restart deployment vaccination-api -n <divoc-namespace>

2. Pod is restarting frequently - If you run kubectl get pods -n and see that the number of pod restarts is high:

There can be multiple reasons why a pod restarts -

  • CPU limit is exceeded by pods: Modify the deployment by increasing the requests and limits on CPU.

  • Memory limit is exceeded by pods: Modify the deployment by increasing the requests and limits on memory.

  • Memory issue in the machine on which Kubernetes (worker node) is installed. We can increase the number of worker nodes or increase the memory of the worker nodes and then recreate pods if necessary.

  • Code issue: Sometimes there can be an issue with the code or the config might be missing. In such cased, we need to fix the bug.

3. Kubernetes cluster is not reachable from Kubeadm master node as SSL certs have expired:

If you encounter the following error:

#> kubectl version Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"} The connection to the server 135.122.6.50:6443 was refused - did you specify the right host or port?

Recovery steps are as follows:

  1. Check if certs have expired: kubeadm alpha certs check-expiration --config=/root/kubernetes/kubeadm-config.yaml

  2. Renew Certs:

  • cd /etc/kubernetes/pki/

  • mv

  • {apiserver.crt,apiserver-etcd-client.key,apiserver-kubelet-client.crt,front-proxy-ca.crt,front-proxy-client.crt,front-proxy-client.key,front-proxy-ca.key,apiserver-kubelet-client.key,apiserver.key,apiserver-etcd-client.crt} ~/

  • kubeadm init phase certs all --apiserver-advertise-address <Specify Master node LAN IP addr>

  • cd /etc/kubernetes/

  • mv {admin.conf,controller-manager.conf,kubelet.conf,scheduler.conf} ~/

  • kubeadm init phase kubeconfig al

3. Reboot server: reboot

4. After reboot ensure docker and all Kube* daemons are up docker ps | grep kube-apiserver

5. Mandatorily replace the config file with newly created one, to resolve “kubectl localhost:8080 connection refused” issue -

  • mkdir -p $HOME/.kube

  • sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

  • sudo chown $(id -u):$(id -g) $HOME/.kube/config

6. Issue Kubectl commands

Resources:

Pre-reads:

Modifying vaccine certificate and template; Branding changes such as UI changes; Adding a new role; Changing the content to the verification page:

Wiki documentation & discussion forum:

Last updated