18.84. DD 84: Simple observability#
18.84.1. Summary#
We want a simple way to check if our various services are working properly and to be notified when they are not.
18.84.2. Motivation#
We had some difficulties managing the degraded operation of libeufin-nexus. The systemd service was working correctly but was failling in a loop. There is currenlty no way to detect this and receive alerts.
We also have the same problem with other services. For example, when the exchange is not configured correctly, wirewatch may fail to retrieve transaction history from libeufin-nexus, but the API would work correctly. We have to check this manually for now.
Right now we only have uptime check with Uptime Kana and we need a better observability system that can detect this degraded states and also provides some context on the failure for faster remediation.
This is where observability comes in. It provide the answers to is it running and why is it not. To do this, you usually setup Grafana and Prometheus, which is the standard maximalist solution, but it has problems:
- It is heavy and can actually take up more ressources than all the services its supposed to monitor
- It is complicated to set up, maintain, and configure
Because of these problems, we don’t have a solution at the moment, as we are delaying this configuration.
I think we can have a simpler and lighter solution that would answers is it running well and why is it not in a minimalistic way.
18.84.3. Requirements#
Easy to implement by services maintainers
Easy to configure
Easy to maintain
Effective at detecting downtime or degraded states
18.84.4. Proposed Solution#
18.84.4.1. Health endpoint#
All services have a least one REST API. This API should expose a health endpoint that would expose the service global status, this means the httpd process but also all the other components it’s use (either ok if everything is fine, or degraded if it’s running but not functioning properly). We are also adding a way to add more context in a less structured way to help with remediation.
interface HealthStatus {
// Whether the service is running fine or in a degraded way
status: "ok" | "degraded";
// Additional context about the service components and processes
context: [key: string]: string;
}
For libeufin-bank:
{
"status": "degraded",
"context": {
"database": "ok",
"tan-sms": "ok",
"tan-email": "failure"
}
}
For libeufin-nexus:
{
"status": "degraded",
"context": {
"database": "ok",
"ebics-submit": "ok",
"ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
"ebics-fetch-latest-success": "2012-04-23T18:25:43.511Z"
"ebics-fetch": "failure",
"ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
"ebics-fetch-latest-success": "2012-03-23T18:25:43.511Z"
}
}
For taler-exchange:
{
"status": "degraded",
"context": {
"database": "ok",
"wirewatch": "failure",
"wirewatch-latest": "2012-04-23T18:25:43.511Z"
}
}
Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded if the API is in place. The JSON body can be shared in the alert, which makes remediation easier because we have hints as to what is not working well inside the service.
18.84.4.2. Logs#
Currently, we also rely on logs to detect failures. To do this, we need to ingest the logs into a monitoring system in order to analyze them and generate alerts. We need to decide whether we want to continue doing this or whether we can choose to use the logs only for remediation and therefore leave them where they are.
Whenever we want to analyze logs to detect a failure condition, we need to see if it is possible for the system to expose it in its health endpoint.
18.84.5. Test Plan#
Add health endpoint to libeufin-bank and libeufin-nexus first
Deploy on demo
Test status update and alerts
18.84.6. Alternatives#
18.84.6.1. Prometheus format with Uptime Kuma#
Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier in the future.
# HELP app_status 1=ok, 0.5=degraded, 0=down
# TYPE app_status gauge
app_status 0.5
# HELP component_database_status 1=ok, 0=failure
# TYPE component_database_status gauge
database_status 1.0
# HELP component_tan_sms_status 1=ok, 0=failure
# TYPE component_tan_sms_status gauge
tan_sms_status 1.0
# HELP component_tan_email_status 1=ok, 0=failure
# TYPE component_tan_email_status gauge
tan_email_status 0.0
We could then also use the existing Taler Observability API but only provide simple metrics for now.
18.84.6.2. Prometheus & Grafana alternative#
We can try to use other more performant while not simpler alternative like Victoria Metrics.
18.84.7. Drawbacks#
This does not resolve the issue of system resources and systemd services. It’s therefore not sufficient for a complete observability system. However, it is easier to implement for now.
We could hack our own health endpoint that expose systemd services, postgres database and system status in a similar manner.