gabrielldn/observability-monitoring-kubernetes
This project automates the deployment and management of an integrated suite of tools for monitoring, logging, tracing, and alerts, providing comprehensive visibility into infrastructure and applications.
O&M Kubernetes Project
Table of Contents
- Introduction
- Architecture
- Project Structure
- Prerequisites
- Installation
- Usage
- Components in Detail
- Configurations
- Troubleshooting
- Contribution
- License
Introduction
The O&M Kubernetes is a complete solution for implementing a modern observability and monitoring stack in Kubernetes environments. This project automates the deployment and management of an integrated suite of tools for monitoring, logging, tracing, and alerts, providing comprehensive visibility into infrastructure and applications.
The solution is designed following the principles of the three pillars of observability and monitoring:
- Metrics: Collection and visualization of metrics with Prometheus and Grafana
- Logs: Aggregation and analysis of logs with Loki and Promtail
- Traces: Distributed tracing with Tempo
Additionally, the stack includes monitoring of external endpoints via Blackbox Exporter and advanced alert management through Alertmanager, with direct integration to webhooks (such as Discord).
Architecture
Components
The observability and monitoring stack consists of the following main components:
- OpenTelemetry Collector: Collects, processes, and exports telemetry data
- Prometheus: Time-series monitoring and alerting system
- Alertmanager: Alert and notification management
- Loki: Log aggregation system inspired by Prometheus
- Grafana: Visualization and analytics platform
- Promtail: Agent that sends logs to Loki
- Tempo: Distributed tracing system
- Blackbox Exporter: Monitoring of external endpoints via HTTP, HTTPS, DNS, TCP, and ICMP
Data Flow
┌─────────────┐
│ Applications│
└──────┬──────┘
│
▼
┌─────────────────────┐
│ OpenTelemetry │
│ Collector │
└───┬───────┬─────┬───┘
│ │ │
┌──────────┘ │ └──────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prometheus │ │ Loki │ │ Tempo │
│ (Metrics) │ │ (Logs) │ │ (Traces) │
└──────┬───────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ │ │
└───────────┬───────┴────────────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Grafana │ │ Alertmanager│
│ (Visual) │ │ (Alerts) │
└─────────────┘ └─────────────┘
Project Structure
observability-monitoring-kubernetes/
├── k8s/
│ ├── configmaps.yaml # Configurations for all components
│ ├── deployments.yaml # Kubernetes deployments for each service
│ ├── namespace.yaml # Dedicated namespace definition
│ └── services.yaml # Kubernetes service definitions
├── script.sh # Stack management script
├── LICENSE # License file (GNU GPL v3)
└── README.md # This documentation
Prerequisites
- Kubernetes Cluster: A functional Kubernetes cluster (Minikube, Kind, EKS, GKE, AKS, etc.)
- kubectl: Kubernetes command-line tool (v1.20+)
- Installation: https://kubernetes.io/docs/tasks/tools/
- Correct configuration of
kubeconfigpointing to the desired cluster
- Permissions: Access to create/modify resources in the cluster (namespaces, deployments, services, configmaps)
- Recommended Resources:
- At least 4GB of available RAM
- At least 2 vCPUs
- At least 10GB of disk space
Installation
-
Clone the repository:
gh repo clone gabrielldn/observability-monitoring-kubernetes cd observability-monitoring-kubernetes -
Verify that kubectl is correctly configured:
kubectl cluster-info
-
Grant execution permission to the script:
chmod +x script.sh
Usage
The script.sh script is the central point for managing the entire observability and monitoring stack.
Deploy the Stack
To deploy the entire observability and monitoring stack:
./script.sh deployThis command will:
- Create the
observabilitynamespace - Apply all ConfigMaps with configurations
- Deploy all components (Deployments)
- Configure the Services for communication between components
Check Status
To check the status of all components:
./script.sh statusThis command will show:
- Status of all pods in the namespace
- Status of all services in the namespace
View Logs
To view logs, there are several options:
# View logs of all pods
./script.sh logs
# View logs of a specific component
./script.sh logs grafana
# View logs in real-time (follow)
./script.sh logs loki -f
# View logs of a component in real-time
./script.sh logs prometheus -fUpdate Stack
To update the stack after configuration changes:
./script.sh updateRemove Stack
To completely remove the stack from the cluster:
./script.sh destroyThis command removes all resources in the following order:
- Services
- Deployments
- ConfigMaps
- Namespace
Components in Detail
OpenTelemetry Collector
Function: Collects, processes, and exports telemetry data (metrics, logs, and traces).
Features:
- Supports gRPC (port 4317) and HTTP (port 4318) protocols
- Configured to send:
- Metrics to Prometheus
- Traces to Tempo
- Logs to Loki
- Processors configured for data enrichment
Access: Internally via otelcollector:4317 or otelcollector:4318
Configuration: See configmaps.yaml - section otel-collector-config
Prometheus
Function: Time-series monitoring and alerting system.
Features:
- Scrape intervals configured to 15 seconds
- Collects metrics from all stack components
- Integrated with Blackbox Exporter for external monitoring
- Alert rules configured
Access: Internally via prometheus:9090
Configuration: See configmaps.yaml - section prometheus-config
Alertmanager
Function: Manages alerts generated by Prometheus, including silencing, inhibition, and grouping.
Features:
- Configured to send alerts to Discord webhook
- Alert grouping by 'alert' and 'job'
- Sends alert resolution notifications
- Repeat interval configured to 30 minutes
Access: Internally via alertmanager:9093
Configuration: See configmaps.yaml - section alertmanager-config
Loki
Function: Log aggregation and query system.
Features:
- Simplified local storage
- Configurable log retention
- Integrated with Grafana for visualization
- Receives logs from Promtail and OpenTelemetry Collector
Access: Internally via loki:3100
Configuration: See configmaps.yaml - section loki-config
Grafana
Function: Visualization and analytics platform for metrics, logs, and traces.
Features:
- Pre-configured with datasources for Prometheus, Loki, and Tempo
- Default credentials: admin/admin
- Default theme set to "light"
- Correlation between metrics, logs, and traces
Access: Internally via grafana:3000
Configuration: See configmaps.yaml - section grafana-datasource
Promtail
Function: Agent that collects logs and sends them to Loki.
Features:
- Automatic discovery of Docker containers with label "logging=promtail"
- Support for multi-line and JSON format
- Addition of labels based on container metadata
Access: Internally via promtail:9080
Configuration: See configmaps.yaml - section promtail-config
Tempo
Function: Backend for storing and querying distributed tracing data.
Features:
- Supports OTLP, Jaeger, and other tracing formats
- Integration with Prometheus for derived metrics
- Integration with Grafana for visualization
- Integration with Loki to correlate traces with logs
Access: Internally via tempo:3200
Configuration: See configmaps.yaml - section tempo-config
Blackbox Exporter
Function: Monitoring of external endpoints via HTTP, HTTPS, DNS, TCP, and ICMP.
Features:
- Support for HTTP, TCP, and ICMP probes
- Monitoring of external site status
- Used by Prometheus for availability checks
Access: Internally via blackbox-exporter:9115
Configuration: See configmaps.yaml - section blackbox-config
Configurations
Customization
To customize the stack:
- Adjust ConfigMaps: Modify the configuration files in
k8s/configmaps.yaml - Adjust Resources: Change resource limits in
k8s/deployments.yaml - Modify Endpoints: Adjust the endpoints monitored by Blackbox in
k8s/configmaps.yaml - After changes: Run
./script.sh updateto apply the modifications
Alerts
The alert system is configured with:
- Alert Rules: Defined in
alert-rules.ymlwithin the Prometheus ConfigMap - Notifications: Configured for Discord in
alertmanager.yml - Customization:
- Modify alert rules in
configmaps.yaml- sectionprometheus-config - Adjust webhooks in
configmaps.yaml- sectionalertmanager-config
- Modify alert rules in
Integrations
The stack comes pre-configured for integration with:
- Discord: For alert notifications
- Instrumented Applications: Via OpenTelemetry Collector
- Kubernetes: Monitoring of cluster resources
To add new integrations:
- Add new receivers in the OpenTelemetry Collector
- Configure new alertmanagers in Prometheus
- Add new datasources in Grafana
Troubleshooting
Common issues and solutions:
-
Pods in CrashLoopBackOff state:
# Check the logs of the problematic pod kubectl logs -n observability <pod-name> # Check pod events kubectl describe pod -n observability <pod-name>
-
Configuration issues:
# Check if ConfigMaps were created correctly kubectl get configmaps -n observability # Inspect a specific ConfigMap kubectl get configmap -n observability <configmap-name> -o yaml
-
Inaccessible services:
# Check if endpoints are correct kubectl get endpoints -n observability -
Check connections between components:
# Use kubectl exec to test connections between pods kubectl exec -it -n observability <pod-name> -- wget -O- <service>:<port>
Contribution
Contributions are welcome! To contribute:
- Fork the repository
- Create a branch for your feature (
git checkout -b feature/new-feature) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
License
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
Developed with ❤️ to simplify the implementation of observability and monitoring in Kubernetes environments.