18.84. DD 84: Simple observability#
18.84.1. Summary#
We want a simple way to check whether our various services are working properly.
18.84.2. Motivation#
We want to have observability, and the obvious maximalist solution is Prometheus and Grafana. The problem is that this system is so complex to configure and so cumbersome that we never have the time to configure properly. By trying to have a perfect solution, we end up with none at all.
I propose a simple solution based on health endpoints that should give us most of what we need more quickly.
18.84.3. Requirements#
Easy to implement by services maintainers
Easy to configure
Easy to maintain
Effective at detecting downtime or degraded states
18.84.4. Proposed Solution#
Each service should have an health endpoint that give its current health status:
interface HealthStatus {
// Whether the service is running fine or in a degraded way
status: "ok" | "degraded";
// Additional information about the service components
components: [key: string]: string;
}
For libeufin-bank:
{
"status": "degraded",
"component": {
"database": "ok",
"tan-sms": "ok",
"tan-email": "failure"
}
}
For libeufin-nexus:
{
"status": "degraded",
"component": {
"database": "ok",
"ebics-submit": "ok",
"ebics-fetch": "failure"
}
}
For taler-exchange:
{
"status": "degraded",
"component": {
"database": "ok",
"wirewatch": "failure"
}
}
Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded event if the API is up. The JSON body can be shared within the alert, which makes debugging easier because we have a clue as to what is failling.
18.84.5. Test Plan#
Add health endpoint to libeufin-bank and libeufin-nexus
Deploy on demo
Test status update and alerts
18.84.6. Alternatives#
18.84.6.1. Prometheus format with Uptime Kuma#
Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier.
# HELP app_status 1=ok, 0.5=degraded, 0=down
# TYPE app_status gauge
app_status 0.5
# HELP component_database_status 1=ok, 0=failure
# TYPE component_database_status gauge
database_status 1.0
# HELP component_tan_sms_status 1=ok, 0=failure
# TYPE component_tan_sms_status gauge
tan_sms_status 1.0
# HELP component_tan_email_status 1=ok, 0=failure
# TYPE component_tan_email_status gauge
tan_email_status 0.0
We could also use the existing Taler Observability API.
18.84.6.2. Prometheus & Grafana alternative#
We can try to use other more performant while not simpler alternative like Victoria Metrics.
18.84.7. Drawbacks#
This does not resolve the issue of system resources and services. It’s therefore not sufficient for a complete observability system. However, it is easier to implement for now.