Files
WellD-challenge/README.md
domenico edb7a69bde commit
2025-10-24 20:55:45 +02:00

76 lines
3.4 KiB
Markdown

# WellD challenge report
## What I did
- **Analyzed the existing Spring Boot/Thymeleaf order management application** for missing metrics.
- **Added and exposed additional custom metrics** using Micrometer and Spring Boot Actuator, making them available at `/actuator/prometheus` for Prometheus scraping.
- **Implemented the following custom metrics:**
- `orders_deleted_total`: Total orders deleted.
- `orders_created_per_product_total` (with `product` tag): Orders created per product.
- `order_quantity_average`: Distribution summary for order quantities (enables average, max, count, sum such as `order_quantity_average_sum` and `order_quantity_average_total`).
- `log_events_total` (with `level` tag): Counts of INFO and ERROR log events.
- **Ensured JVM and HTTP metrics are available** (latency, memory, CPU, threads, GC, etc.) via Actuator.
- **Created a Grafana dashboard** (see `grafana/monitoring/`) with panels for all key metrics and business KPIs.
---
## How to use the dashboard
1. Build the source code with `mvn clean package`
2. Start the stack: `docker compose up -d`
3. Access Prometheus at http://localhost:9090
4. Access Grafana at http://localhost:3000 (default login: `admin:admin`)
5. Import the JSON dashboard provided in this repository under `grafana/monitoring/`
6. Interact with the application at http://localhost:8080/web/orders , metrics will update in realtime on the dashboard.
---
## Which panels were created and their purpose
- **Total Orders Created:**
Visualizes the cumulative number of orders created (`orders_created_total`).
- **Total Orders Deleted:**
Shows the number of orders deleted (`orders_deleted_total`).
- **Orders per Product:**
Bar chart/table using `orders_created_per_product_total{product="..."}` for business insight.
- **Average Quantity per Order:**
Displays the average order quantity using `order_quantity_average_sum / order_quantity_average_count`.
- **HTTP Request Latency (p95, p99):**
Shows high-percentile request durations per endpoint using:
`histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))`
and
`histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))`
- **JVM Memory & CPU Usage:**
Monitors resource usage with `jvm_memory_used_bytes`, `process_cpu_usage`, etc.
- **Log Event Counters:**
Visualizes counts of INFO and ERROR logs (`log_events_total{level="INFO"}` and `log_events_total{level="ERROR"}`).
- **Other Stability Metrics:**
Panels for thread count, GC pause time, and queue size.
## Proposed alerts and thresholds
1. **High HTTP Latency**, threshold of `p95 > 1s for 5m`, to detect slow endpoints
2. **High ERROR log rate**, threshold of `>5 errors/min`, might indicate bugs or failures
3. **High JVM memory usage**, threshold of `>80% for 5m`, might help to prevent OOM errors
4. **High CPU usage**, threshold of `>80% for 5m`, might help to detect resource exhaustion
---
## Custom metrics and their purpose
- **orders_deleted_total:** Tracks deletions for auditing and anomaly detection.
- **orders_created_per_product_total (with optional `product` tag):** Enables product-level business insights and anomaly detection.
- **order_quantity_average:** Monitors average order size, useful for business KPIs and detecting outliers.
- **log_events_total (with `level` tag):** Provides visibility into application health and error rates.