๐๏ธโ๐จ๏ธ Kubernetes Monitoring ๐

๐๐ Prometheus & Grafana ๐๏ธโ๐จ๏ธ
๐ Visualize Cluster Information in Dashboards:
- Use Grafana to create dashboards that display essential cluster metrics, such as CPU and memory usage, node health, and pod status.
๐ Pull Custom Application Logs via Sidecar:
- Implement a sidecar container in your pods to extract custom application logs and make them available for monitoring.
๐ป Create Dashboards as Code for Easy Editing:
- Opt for Infrastructure as Code (IaC) to define and manage your Grafana dashboards. This ensures easy editing, version control, and reproducibility of your monitoring setup.
Embrace Prometheus and Grafana to gain valuable insights into your Kubernetes cluster's health and performance. ๐๐ #KubernetesMonitoring #Prometheus #Grafana ๐ ๏ธ
๐จ Incident Response Scenario: Leaderboard Service Outage ๐จ
Scenario: You are a Container Engineer responsible for maintaining an e-commerce platform's critical leaderboard service. This service displays the top-selling products on the platform. Suddenly, you receive an urgent alert notifying you that the leaderboard service is down, and it's impacting the user experience. Your task is to quickly respond to this incident, diagnose the issue, and restore the service to its normal working state.
Role: Container Engineer ๐ณ Platform: E-commerce ๐ Service: Leaderboard ๐
๐ Real-time Actions:
๐ฅ Immediate Alert Acknowledgment: You swiftly acknowledge the alert, notifying your team that you're diving into the incident.
๐ Monitoring and Logging Tools: Access real-time data using monitoring tools like Prometheus and Grafana to gauge cluster health and resource use.
๐ต๏ธ Kubernetes Cluster Status Check: Utilize
kubectlto confirm the Kubernetes cluster status. Ensure it's not a global cluster issue, inspect nodes and control plane components.๐ Leaderboard Service Logs: Check service logs for error messages with
kubectl logsand evaluate recent events.๐ฆ Pod Inspection: Run
kubectl get podsto list all pods, including the leaderboard pod. Spot the "Failed" status.๐ Troubleshooting the Pod: Use
kubectl describe pod <pod-name>to uncover details about the failure, including resource and mounting issues.๐พ Resource Check: Review resource requests and limits in the pod configuration to avoid resource starvation.
๐ Rolling Restarts: If issues are found, trigger a rolling restart by updating the Deployment to create fresh pods.
๐จโโ๏ธ Health Checks: Ensure liveness and readiness probes in the deployment are correctly configured.
๐ Integration and Network Issues: Investigate integration and network problems within the cluster.
๐ Database Connectivity: Verify the leaderboard service's ability to connect to the database, essential for fetching sales data.
๐ฆ Docker Image: Confirm availability and correctness of the Docker image in the deployment configuration.
๐ก๏ธ Service Checks: Confirm the Kubernetes service correctly routes traffic and is reachable.
๐ Backup and Rollback Plan: Maintain a rollback plan in case of prolonged issues. Consider implementing a backup mechanism for a default leaderboard.
๐ Documentation and Communication: Document all actions and updates, keeping the team and stakeholders informed.
โ Resolution and Verification: After addressing the root cause, verify that the leaderboard service is operational and meets performance expectations.
๐ Post-Incident Analysis: Conduct a post-incident analysis to understand the cause, document lessons learned, and implement preventive measures.
In this real-time scenario, swift response, a systematic troubleshooting approach, and effective communication are vital to minimize downtime and maintain a positive user experience on the e-commerce platform. #IncidentResponse #Kubernetes #Ecommerce #ContainerEngineer
๐ Monitoring Priorities ๐
1. Node Health:
- ๐ฅ Monitor node health to ensure each node in the cluster is running smoothly.
2. Cluster CPU/Memory Capacity:
- ๐ป Keep an eye on cluster-wide CPU and memory capacity to prevent resource bottlenecks.
3. Pod Health Checks:
- โค๏ธโ๐ฉน Implement health checks for pods to detect issues and ensure they're in a healthy state.
4. Networking:
- ๐ Monitor network traffic and connectivity to guarantee seamless communication between pods and services.
5. Application Logs:
- ๐ Collect and analyze application logs for insights into app behavior and potential issues.
Objectives:
1. Identify Pod Configuration Error:
- ๐ต๏ธโโ๏ธ Identify the error within the pod's configuration causing the app malfunction.
2. Update Pod Configuration:
- ๐ Revise the pod's configuration to bring the app back to its expected, functioning state.
Incorporating monitoring and addressing configuration issues are key elements of maintaining a healthy and operational Kubernetes environment. ๐
Observer the working nodes

Observe the running workloads

Get more information about the pod leaderboard


Check the logs for 'query-app' container

Some one had typo in the commands 'ech' which caused an error

2. Update Pod Configuration:
- ๐ Revise the pod's configuration to bring the app back to its expected, functioning state.
Export the leaderboard pod configurations into a leaderboard.yaml file:
kubectl get pod leaderboard -o yaml > leaderboard.yaml Open the file:

vim leaderboard.yaml Edit the command to be echo instead of ech.

Save and exit the file by pressing Escape followed by wq.
Attempt to update the pod:
kubectl apply -f leaderboard.yaml We can't update the command key for a running pod, so you'll see an error instead.

Delete the pod:
kubectl delete pod leaderboard Confirm it's gone:


kubectl get pods Re-create the pod:
kubectl apply -f leaderboard.yaml Confirm it exists:

kubectl get pods Check the logs for the updated query-app container:

kubectl logs leaderboard -c query-app Get the pod description again:

kubectl describe pod leaderboard


#KubernetesMonitoring #PodConfiguration #AppMaintenance ๐ ๏ธ





