# WellD challenge report ## What I did - **Analyzed the existing Spring Boot/Thymeleaf order management application** for missing metrics. - **Added and exposed additional custom metrics** using Micrometer and Spring Boot Actuator, making them available at `/actuator/prometheus` for Prometheus scraping. - **Implemented the following custom metrics:** - `orders_deleted_total`: Total orders deleted. - `orders_created_per_product_total` (with `product` tag): Orders created per product. - `order_quantity_average`: Distribution summary for order quantities (enables average, max, count, sum such as `order_quantity_average_sum` and `order_quantity_average_total`). - `log_events_total` (with `level` tag): Counts of INFO and ERROR log events. - **Ensured JVM and HTTP metrics are available** (latency, memory, CPU, threads, GC, etc.) via Actuator. - **Created a Grafana dashboard** (see `grafana/monitoring/`) with panels for all key metrics and business KPIs. --- ## How to use the dashboard 1. Build the source code with `mvn clean package` 2. Start the stack: `docker compose up -d` 3. Access Prometheus at http://localhost:9090 4. Access Grafana at http://localhost:3000 (default login: `admin:admin`) 5. Import the JSON dashboard provided in this repository under `grafana/monitoring/` 6. Interact with the application at http://localhost:8080/web/orders , metrics will update in realtime on the dashboard. --- ## Which panels were created and their purpose - **Total Orders Created:** Visualizes the cumulative number of orders created (`orders_created_total`). - **Total Orders Deleted:** Shows the number of orders deleted (`orders_deleted_total`). - **Orders per Product:** Bar chart/table using `orders_created_per_product_total{product="..."}` for business insight. - **Average Quantity per Order:** Displays the average order quantity using `order_quantity_average_sum / order_quantity_average_count`. - **HTTP Request Latency (p95, p99):** Shows high-percentile request durations per endpoint using: `histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))` and `histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))` - **JVM Memory & CPU Usage:** Monitors resource usage with `jvm_memory_used_bytes`, `process_cpu_usage`, etc. - **Log Event Counters:** Visualizes counts of INFO and ERROR logs (`log_events_total{level="INFO"}` and `log_events_total{level="ERROR"}`). - **Other Stability Metrics:** Panels for thread count, GC pause time, and queue size. ## Proposed alerts and thresholds 1. **High HTTP Latency**, threshold of `p95 > 1s for 5m`, to detect slow endpoints 2. **High ERROR log rate**, threshold of `>5 errors/min`, might indicate bugs or failures 3. **High JVM memory usage**, threshold of `>80% for 5m`, might help to prevent OOM errors 4. **High CPU usage**, threshold of `>80% for 5m`, might help to detect resource exhaustion --- ## Custom metrics and their purpose - **orders_deleted_total:** Tracks deletions for auditing and anomaly detection. - **orders_created_per_product_total (with optional `product` tag):** Enables product-level business insights and anomaly detection. - **order_quantity_average:** Monitors average order size, useful for business KPIs and detecting outliers. - **log_events_total (with `level` tag):** Provides visibility into application health and error rates.