WellD challenge report

What I did

Analyzed the existing Spring Boot/Thymeleaf order management application for missing metrics.
Added and exposed additional custom metrics using Micrometer and Spring Boot Actuator, making them available at /actuator/prometheus for Prometheus scraping.
Implemented the following custom metrics:
- orders_deleted_total: Total orders deleted.
- orders_created_per_product_total (with product tag): Orders created per product.
- order_quantity_average: Distribution summary for order quantities (enables average, max, count, sum such as order_quantity_average_sum and order_quantity_average_total).
log_events_total (with level tag): Counts of INFO and ERROR log events.
Ensured JVM and HTTP metrics are available (latency, memory, CPU, threads, GC, etc.) via Actuator.
Created a Grafana dashboard (see grafana/monitoring/) with panels for all key metrics and business KPIs.

Build the source code with mvn clean package
Start the stack: docker compose up -d
Access Prometheus at http://localhost:9090
Access Grafana at http://localhost:3000 (default login: admin:admin)
Import the JSON dashboard provided in this repository under grafana/monitoring/
Interact with the application at http://localhost:8080/web/orders , metrics will update in realtime on the dashboard.

Total Orders Created: Visualizes the cumulative number of orders created (orders_created_total).
Total Orders Deleted: Shows the number of orders deleted (orders_deleted_total).
Orders per Product: Bar chart/table using orders_created_per_product_total{product="..."} for business insight.
Average Quantity per Order: Displays the average order quantity using order_quantity_average_sum / order_quantity_average_count.
HTTP Request Latency (p95, p99): Shows high-percentile request durations per endpoint using: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri)) and histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))
JVM Memory & CPU Usage: Monitors resource usage with jvm_memory_used_bytes, process_cpu_usage, etc.
Log Event Counters: Visualizes counts of INFO and ERROR logs (log_events_total{level="INFO"} and log_events_total{level="ERROR"}).
Other Stability Metrics: Panels for thread count, GC pause time, and queue size.

High HTTP Latency, threshold of p95 > 1s for 5m, to detect slow endpoints
High ERROR log rate, threshold of >5 errors/min, might indicate bugs or failures
High JVM memory usage, threshold of >80% for 5m, might help to prevent OOM errors
High CPU usage, threshold of >80% for 5m, might help to detect resource exhaustion

orders_deleted_total: Tracks deletions for auditing and anomaly detection.
orders_created_per_product_total (with optional product tag): Enables product-level business insights and anomaly detection.
order_quantity_average: Monitors average order size, useful for business KPIs and detecting outliers.
log_events_total (with level tag): Provides visibility into application health and error rates.