domenico edb7a69bde commit
2025-10-24 20:55:45 +02:00
2025-10-24 20:55:45 +02:00
2025-10-24 20:55:45 +02:00
2025-10-24 20:55:45 +02:00

WellD challenge report

What I did

  • Analyzed the existing Spring Boot/Thymeleaf order management application for missing metrics.
  • Added and exposed additional custom metrics using Micrometer and Spring Boot Actuator, making them available at /actuator/prometheus for Prometheus scraping.
  • Implemented the following custom metrics:
    • orders_deleted_total: Total orders deleted.
    • orders_created_per_product_total (with product tag): Orders created per product.
    • order_quantity_average: Distribution summary for order quantities (enables average, max, count, sum such as order_quantity_average_sum and order_quantity_average_total).
  • log_events_total (with level tag): Counts of INFO and ERROR log events.
  • Ensured JVM and HTTP metrics are available (latency, memory, CPU, threads, GC, etc.) via Actuator.
  • Created a Grafana dashboard (see grafana/monitoring/) with panels for all key metrics and business KPIs.

How to use the dashboard

  1. Build the source code with mvn clean package

  2. Start the stack: docker compose up -d

  3. Access Prometheus at http://localhost:9090

  4. Access Grafana at http://localhost:3000 (default login: admin:admin)

  5. Import the JSON dashboard provided in this repository under grafana/monitoring/

  6. Interact with the application at http://localhost:8080/web/orders , metrics will update in realtime on the dashboard.


Which panels were created and their purpose

  • Total Orders Created: Visualizes the cumulative number of orders created (orders_created_total).
  • Total Orders Deleted: Shows the number of orders deleted (orders_deleted_total).
  • Orders per Product: Bar chart/table using orders_created_per_product_total{product="..."} for business insight.
  • Average Quantity per Order: Displays the average order quantity using order_quantity_average_sum / order_quantity_average_count.
  • HTTP Request Latency (p95, p99): Shows high-percentile request durations per endpoint using: histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri)) and histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))
  • JVM Memory & CPU Usage: Monitors resource usage with jvm_memory_used_bytes, process_cpu_usage, etc.
  • Log Event Counters: Visualizes counts of INFO and ERROR logs (log_events_total{level="INFO"} and log_events_total{level="ERROR"}).
  • Other Stability Metrics: Panels for thread count, GC pause time, and queue size.

Proposed alerts and thresholds

  1. High HTTP Latency, threshold of p95 > 1s for 5m, to detect slow endpoints

  2. High ERROR log rate, threshold of >5 errors/min, might indicate bugs or failures

  3. High JVM memory usage, threshold of >80% for 5m, might help to prevent OOM errors

  4. High CPU usage, threshold of >80% for 5m, might help to detect resource exhaustion


Custom metrics and their purpose

  • orders_deleted_total: Tracks deletions for auditing and anomaly detection.
  • orders_created_per_product_total (with optional product tag): Enables product-level business insights and anomaly detection.
  • order_quantity_average: Monitors average order size, useful for business KPIs and detecting outliers.
  • log_events_total (with level tag): Provides visibility into application health and error rates.
Description
No description provided
Readme 41 KiB
Languages
Java 66.5%
HTML 32.5%
Dockerfile 1%