76 lines
3.4 KiB
Markdown
76 lines
3.4 KiB
Markdown
# WellD challenge report
|
|
|
|
## What I did
|
|
|
|
- **Analyzed the existing Spring Boot/Thymeleaf order management application** for missing metrics.
|
|
- **Added and exposed additional custom metrics** using Micrometer and Spring Boot Actuator, making them available at `/actuator/prometheus` for Prometheus scraping.
|
|
- **Implemented the following custom metrics:**
|
|
- `orders_deleted_total`: Total orders deleted.
|
|
- `orders_created_per_product_total` (with `product` tag): Orders created per product.
|
|
- `order_quantity_average`: Distribution summary for order quantities (enables average, max, count, sum such as `order_quantity_average_sum` and `order_quantity_average_total`).
|
|
- `log_events_total` (with `level` tag): Counts of INFO and ERROR log events.
|
|
- **Ensured JVM and HTTP metrics are available** (latency, memory, CPU, threads, GC, etc.) via Actuator.
|
|
- **Created a Grafana dashboard** (see `grafana/monitoring/`) with panels for all key metrics and business KPIs.
|
|
|
|
---
|
|
|
|
## How to use the dashboard
|
|
|
|
1. Build the source code with `mvn clean package`
|
|
|
|
2. Start the stack: `docker compose up -d`
|
|
|
|
3. Access Prometheus at http://localhost:9090
|
|
|
|
4. Access Grafana at http://localhost:3000 (default login: `admin:admin`)
|
|
|
|
5. Import the JSON dashboard provided in this repository under `grafana/monitoring/`
|
|
|
|
6. Interact with the application at http://localhost:8080/web/orders , metrics will update in realtime on the dashboard.
|
|
|
|
|
|
---
|
|
|
|
## Which panels were created and their purpose
|
|
|
|
- **Total Orders Created:**
|
|
Visualizes the cumulative number of orders created (`orders_created_total`).
|
|
- **Total Orders Deleted:**
|
|
Shows the number of orders deleted (`orders_deleted_total`).
|
|
- **Orders per Product:**
|
|
Bar chart/table using `orders_created_per_product_total{product="..."}` for business insight.
|
|
- **Average Quantity per Order:**
|
|
Displays the average order quantity using `order_quantity_average_sum / order_quantity_average_count`.
|
|
- **HTTP Request Latency (p95, p99):**
|
|
Shows high-percentile request durations per endpoint using:
|
|
`histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))`
|
|
and
|
|
`histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))`
|
|
- **JVM Memory & CPU Usage:**
|
|
Monitors resource usage with `jvm_memory_used_bytes`, `process_cpu_usage`, etc.
|
|
- **Log Event Counters:**
|
|
Visualizes counts of INFO and ERROR logs (`log_events_total{level="INFO"}` and `log_events_total{level="ERROR"}`).
|
|
- **Other Stability Metrics:**
|
|
Panels for thread count, GC pause time, and queue size.
|
|
|
|
|
|
## Proposed alerts and thresholds
|
|
|
|
1. **High HTTP Latency**, threshold of `p95 > 1s for 5m`, to detect slow endpoints
|
|
|
|
2. **High ERROR log rate**, threshold of `>5 errors/min`, might indicate bugs or failures
|
|
|
|
3. **High JVM memory usage**, threshold of `>80% for 5m`, might help to prevent OOM errors
|
|
|
|
4. **High CPU usage**, threshold of `>80% for 5m`, might help to detect resource exhaustion
|
|
|
|
---
|
|
|
|
## Custom metrics and their purpose
|
|
|
|
- **orders_deleted_total:** Tracks deletions for auditing and anomaly detection.
|
|
- **orders_created_per_product_total (with optional `product` tag):** Enables product-level business insights and anomaly detection.
|
|
- **order_quantity_average:** Monitors average order size, useful for business KPIs and detecting outliers.
|
|
- **log_events_total (with `level` tag):** Provides visibility into application health and error rates.
|
|
|