3.4 KiB
3.4 KiB
WellD challenge report
What I did
- Analyzed the existing Spring Boot/Thymeleaf order management application for missing metrics.
- Added and exposed additional custom metrics using Micrometer and Spring Boot Actuator, making them available at
/actuator/prometheusfor Prometheus scraping. - Implemented the following custom metrics:
orders_deleted_total: Total orders deleted.orders_created_per_product_total(withproducttag): Orders created per product.order_quantity_average: Distribution summary for order quantities (enables average, max, count, sum such asorder_quantity_average_sumandorder_quantity_average_total).
log_events_total(withleveltag): Counts of INFO and ERROR log events.- Ensured JVM and HTTP metrics are available (latency, memory, CPU, threads, GC, etc.) via Actuator.
- Created a Grafana dashboard (see
grafana/monitoring/) with panels for all key metrics and business KPIs.
How to use the dashboard
-
Build the source code with
mvn clean package -
Start the stack:
docker compose up -d -
Access Prometheus at http://localhost:9090
-
Access Grafana at http://localhost:3000 (default login:
admin:admin) -
Import the JSON dashboard provided in this repository under
grafana/monitoring/ -
Interact with the application at http://localhost:8080/web/orders , metrics will update in realtime on the dashboard.
Which panels were created and their purpose
- Total Orders Created:
Visualizes the cumulative number of orders created (
orders_created_total). - Total Orders Deleted:
Shows the number of orders deleted (
orders_deleted_total). - Orders per Product:
Bar chart/table using
orders_created_per_product_total{product="..."}for business insight. - Average Quantity per Order:
Displays the average order quantity using
order_quantity_average_sum / order_quantity_average_count. - HTTP Request Latency (p95, p99):
Shows high-percentile request durations per endpoint using:
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri))andhistogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le, uri)) - JVM Memory & CPU Usage:
Monitors resource usage with
jvm_memory_used_bytes,process_cpu_usage, etc. - Log Event Counters:
Visualizes counts of INFO and ERROR logs (
log_events_total{level="INFO"}andlog_events_total{level="ERROR"}). - Other Stability Metrics: Panels for thread count, GC pause time, and queue size.
Proposed alerts and thresholds
-
High HTTP Latency, threshold of
p95 > 1s for 5m, to detect slow endpoints -
High ERROR log rate, threshold of
>5 errors/min, might indicate bugs or failures -
High JVM memory usage, threshold of
>80% for 5m, might help to prevent OOM errors -
High CPU usage, threshold of
>80% for 5m, might help to detect resource exhaustion
Custom metrics and their purpose
- orders_deleted_total: Tracks deletions for auditing and anomaly detection.
- orders_created_per_product_total (with optional
producttag): Enables product-level business insights and anomaly detection. - order_quantity_average: Monitors average order size, useful for business KPIs and detecting outliers.
- log_events_total (with
leveltag): Provides visibility into application health and error rates.