๐Ÿ‘๏ธโ€๐Ÿ—จ๏ธ Kubernetes Monitoring ๐Ÿš€

๐Ÿ‘๏ธโ€๐Ÿ—จ๏ธ Kubernetes Monitoring ๐Ÿš€

ยท

5 min read

๐Ÿ“Š๐Ÿ“ˆ Prometheus & Grafana ๐Ÿ‘๏ธโ€๐Ÿ—จ๏ธ

  1. ๐Ÿ“ˆ Visualize Cluster Information in Dashboards:

    • Use Grafana to create dashboards that display essential cluster metrics, such as CPU and memory usage, node health, and pod status.
  2. ๐Ÿ“š Pull Custom Application Logs via Sidecar:

    • Implement a sidecar container in your pods to extract custom application logs and make them available for monitoring.
  3. ๐Ÿ’ป Create Dashboards as Code for Easy Editing:

    • Opt for Infrastructure as Code (IaC) to define and manage your Grafana dashboards. This ensures easy editing, version control, and reproducibility of your monitoring setup.

Embrace Prometheus and Grafana to gain valuable insights into your Kubernetes cluster's health and performance. ๐ŸŒ๐Ÿ” #KubernetesMonitoring #Prometheus #Grafana ๐Ÿ› ๏ธ

๐Ÿšจ Incident Response Scenario: Leaderboard Service Outage ๐Ÿšจ

Scenario: You are a Container Engineer responsible for maintaining an e-commerce platform's critical leaderboard service. This service displays the top-selling products on the platform. Suddenly, you receive an urgent alert notifying you that the leaderboard service is down, and it's impacting the user experience. Your task is to quickly respond to this incident, diagnose the issue, and restore the service to its normal working state.

Role: Container Engineer ๐Ÿณ Platform: E-commerce ๐Ÿ›’ Service: Leaderboard ๐Ÿ†

๐Ÿ“… Real-time Actions:

  1. ๐Ÿ”ฅ Immediate Alert Acknowledgment: You swiftly acknowledge the alert, notifying your team that you're diving into the incident.

  2. ๐Ÿ“Š Monitoring and Logging Tools: Access real-time data using monitoring tools like Prometheus and Grafana to gauge cluster health and resource use.

  3. ๐Ÿ•ต๏ธ Kubernetes Cluster Status Check: Utilize kubectl to confirm the Kubernetes cluster status. Ensure it's not a global cluster issue, inspect nodes and control plane components.

  4. ๐Ÿ“œ Leaderboard Service Logs: Check service logs for error messages with kubectl logs and evaluate recent events.

  5. ๐Ÿ“ฆ Pod Inspection: Run kubectl get pods to list all pods, including the leaderboard pod. Spot the "Failed" status.

  6. ๐Ÿ” Troubleshooting the Pod: Use kubectl describe pod <pod-name> to uncover details about the failure, including resource and mounting issues.

  7. ๐Ÿ’พ Resource Check: Review resource requests and limits in the pod configuration to avoid resource starvation.

  8. ๐Ÿ”„ Rolling Restarts: If issues are found, trigger a rolling restart by updating the Deployment to create fresh pods.

  9. ๐Ÿ‘จโ€โš•๏ธ Health Checks: Ensure liveness and readiness probes in the deployment are correctly configured.

  10. ๐ŸŒ Integration and Network Issues: Investigate integration and network problems within the cluster.

  11. ๐Ÿ”Œ Database Connectivity: Verify the leaderboard service's ability to connect to the database, essential for fetching sales data.

  12. ๐Ÿ“ฆ Docker Image: Confirm availability and correctness of the Docker image in the deployment configuration.

  13. ๐Ÿ›ก๏ธ Service Checks: Confirm the Kubernetes service correctly routes traffic and is reachable.

  14. ๐Ÿ”„ Backup and Rollback Plan: Maintain a rollback plan in case of prolonged issues. Consider implementing a backup mechanism for a default leaderboard.

  15. ๐Ÿ“ Documentation and Communication: Document all actions and updates, keeping the team and stakeholders informed.

  16. โœ… Resolution and Verification: After addressing the root cause, verify that the leaderboard service is operational and meets performance expectations.

  17. ๐Ÿ” Post-Incident Analysis: Conduct a post-incident analysis to understand the cause, document lessons learned, and implement preventive measures.

In this real-time scenario, swift response, a systematic troubleshooting approach, and effective communication are vital to minimize downtime and maintain a positive user experience on the e-commerce platform. #IncidentResponse #Kubernetes #Ecommerce #ContainerEngineer

๐Ÿ‘€ Monitoring Priorities ๐Ÿ‘€

1. Node Health:

  • ๐Ÿฅ Monitor node health to ensure each node in the cluster is running smoothly.

2. Cluster CPU/Memory Capacity:

  • ๐Ÿ’ป Keep an eye on cluster-wide CPU and memory capacity to prevent resource bottlenecks.

3. Pod Health Checks:

  • โค๏ธโ€๐Ÿฉน Implement health checks for pods to detect issues and ensure they're in a healthy state.

4. Networking:

  • ๐ŸŒ Monitor network traffic and connectivity to guarantee seamless communication between pods and services.

5. Application Logs:

  • ๐Ÿ“‹ Collect and analyze application logs for insights into app behavior and potential issues.

Objectives:

1. Identify Pod Configuration Error:

  • ๐Ÿ•ต๏ธโ€โ™‚๏ธ Identify the error within the pod's configuration causing the app malfunction.

2. Update Pod Configuration:

  • ๐Ÿ”„ Revise the pod's configuration to bring the app back to its expected, functioning state.

Incorporating monitoring and addressing configuration issues are key elements of maintaining a healthy and operational Kubernetes environment. ๐Ÿš€

Observer the working nodes

Observe the running workloads

Get more information about the pod leaderboard

Check the logs for 'query-app' container

Some one had typo in the commands 'ech' which caused an error

2. Update Pod Configuration:

  • ๐Ÿ”„ Revise the pod's configuration to bring the app back to its expected, functioning state.

Export the leaderboard pod configurations into a leaderboard.yaml file:

kubectl get pod leaderboard -o yaml > leaderboard.yaml Open the file:

vim leaderboard.yaml Edit the command to be echo instead of ech.

Save and exit the file by pressing Escape followed by wq.

Attempt to update the pod:

kubectl apply -f leaderboard.yaml We can't update the command key for a running pod, so you'll see an error instead.

Delete the pod:

kubectl delete pod leaderboard Confirm it's gone:

kubectl get pods Re-create the pod:

kubectl apply -f leaderboard.yaml Confirm it exists:

kubectl get pods Check the logs for the updated query-app container:

kubectl logs leaderboard -c query-app Get the pod description again:

kubectl describe pod leaderboard

#KubernetesMonitoring #PodConfiguration #AppMaintenance ๐Ÿ› ๏ธ

ย