DD 59: Statistics
#################

Summary
=======

This design document elaborates how we track various statistics
in the exchange and merchant, typically for tax reporting or
to detect anomalies to be investigated by anti-money laundering
officers. The key idea is to use SQL triggers to keep the
statistics always up-to-date and a bit of garbage collection
to expire ancient statistics. Finally, deployment-specific
statistics can easily be added this way by simply injecting
the correct SQL code into the backend, without having to modify
the core exchange or merchant logic.


Motivation
==========

Exchange operators are required to monitor for suspicious
transactions as part of their AML efforts. Merchants need to
collect certain data for their business, especially for tax
purposes but also conceivably to analyze sales. The specific
data to be tracked varies by operator (and legislation), so
we need to be quite flexible in terms of which statistics
should be kept, especially to minimize the performance impact.


Requirements
============

- statistics should always be up-to-date (in real-time) and
  not only be updated in batches
- some statistics are amounts, others are simple numerical
  (integer) values
- some statistics need to be kept over a sliding interval that
  moves over  time, while others need to be mapped to fixed
  buckets such as a day, month, quarter or year.
- which statistics are being tracked may depend on the
  operational context, especially for the exchange; it must
  thus be easy to add (or remove) statistics at any time;
- while tracking statistics inherently costs performance the
  runtime (CPU and storage) overhead should be minimized;
  in particular for "sliding intervals", the events that
  "slide out" of the interval may be coarsened and do not
  necessarily require accuracy down to the second;
- when adding new statistics, it may be desirable to compute
  them retroactivey over historic data (if available);
- the SPAs displaying statistics should not have to make an
  excessive number of REST API calls, so generally multiple
  values should be returned from a single endpoint;
- the merchant is multi-currency capable, and thus amount-valued
  statistics in the merchant backend should be kept per currency


Proposed Solution
=================

At a high-level, we use SQL TRIGGERs to update statistics on-the-fly whenever
an INSERT or UPDATE changes values in the database that would affect the
statistics we are tracking.  The statistics themselves are also stored in the
database. Additionally, we also store meta-data in the database that
determines the granularity of the statistics to keep. This way, the SQL
TRIGGERs can be generic.


Tables
------

The schema consists of several tables:

- **statistic_bucket_meta**:
  meta data for statistics we track in buckets, including
  a slug used as the main key to identify the statistic,
  the data type (amount or number), and a pair of
  range of the bucket (year, quarter, month, etc.)
  and how many ages (generations) of that bucket to keep.
  Basically, now minus age times range tells us how far
  into the past statistics are kept.
- **statistic_interval_meta**:
  meta data used for statistics we keep by time interval,
  with again a slug and data type, but then pairs
  consisting of a number of seconds for the range
  and a precision.  A range of 60s implies that we
  count the events of the last 60s in the interval.
  Usually multiple ranges are given, like 60s, 120s
  and 180s to keep events for the last one, two and
  three minutes. Note that in the database the data
  kept for the 120s range excludes the data from the
  first 60s, and the correct amount for any interval
  is computed on-the-fly by adding up the amounts for
  the smaller intervals. A precision of 5s means that
  event timestamps are normalized (and rounded down)
  to multiples of 5s and thus events at 64s may still
  be counted for the 60s interval.  Note that both
  intervals and precisions must be monotonically
  increasing in their respective arrays.
- **statistic_bucket_counter**: contains the current
  numeric (non-amount) value for a particular bucket
  (given by starting point and range) and object;
  objects in the exchange are (normalized) payto
  hashes identifying bank accounts, while objects for
  the merchant backend are the instances of the merchant.
- **statistic_bucket_amount**: same as the counter
  bucket, except for an amount; in the case of the
  merchant the *currency* is an additional dimension
  as amounts are tracked per currency.
- **statistic_counter_event**: represents the sum of
  all (counter) events at a particular time-*slot* in
  time, again given per object; the *slot* is the
  time of the event rounded to the *presision* of the
  interval the event falls into.
- **statistic_amount_event**: same as the counter
  event, except for amounts.
- **statistic_interval_counter**: cumulative value
  of the counter for the given range; additionally
  includes the unique ID of the oldest event that
  is included in the counter (as the range is given
  as a relative time, we need to track which events
  are included).
- **statistic_interval_amount**: same as the
  interval counter, except for amounts.

Stored procedures
-----------------

We additionally provide a few stored procedures to
update the statistics.  These are:

* **bump_number_bucket_stat** (slug,object,timestamp,delta):
  increases the *slug* bucket counter of *object* at *timestamp*
  by *delta*.  Does nothing if *slug* is not in the
  bucket meta table or if *timestamp* is past the range
  of buckets we are currently tracking;
* **bump_amount_bucket_stat** (slug,object,timestamp,amount):
  similar to the above, just for statistics of type amount;
* **bump_number_interval_stat** (slug,object,timestamp,delta):
  increases the *slug* interval counter of *object* at
  *timestamp* by *delta*;
* **bump_amount_interval_stat** (slug,object,timestamp,amount):
  similar to the above, just for statistics of type amount;
* **bump_number_stat** (slug, object, timestamp, delta):
  increases both bucket and interval counters for *slug*
  of *object* at *timestamp* by *delta*;
* **bump_amount_stat** (slug, object, timestamp, delta):
  similar to the above, just for statistics of type amount;

We furthermore provide a few stored procedures to
access the interval statistics (while making sure
they are current).  These are:

* **statistic_interval_number_get** (slug, object):
  returns all non-zero counters and time intervals that
  we are tracking for the given *slug* and *object*;
  the intervals are updated and events are possibly
  discarded or coarsened (if numeric value of the
  precision for larger intervals is larger and thus
  allows for more rounding); 
* **statistic_interval_amount_get** (slug, object):
  returns all non-zero amounts and time intervals that
  we are tracking for the given *slug* and *object*;
  the intervals are updated and events are possibly
  discarded or coarsened (if numeric value of the
  precision for larger intervals is larger and thus
  allows for more rounding); 

Finally, there are some helpers for cleaning up:

* **statistic_bucket_gc** ():
  Removes buckets and events past the range for which
  we track statistics; also coarsens events to the
  precision of the range into which they are falling.
* **exchange_drop_customization** (schema):
  special function for the exchange which allows dropping
  customizations by *schema*.

Schema
------

For the exchange, all triggers (and associated stored
procedures) should be stored in one or more payment
service provider specific schema.  Furthermore, the name
of the schema MUST be provided in the *schema* column of
the meta data table entries.  This way, all triggers,
stored procedures and statistics can be removed simply
by DROPing the SCHEMA and removing the associated
entries from the meta data tables (the foreign key
constraints then CASCADE and clean up the statistics
themselves).

The SCHEMA name must also be used as the prefix
for the SQL files that inject the triggers. The
usual database versioning should also be used,
except this time using the SCHEMA name instead of
"exchange-". "exchange" is not allowed as a name
for customization SCHEMA.

**taler-exchange-dbinit** is extended with command line arguments to load the
latest version of a customization schema or to call
**exchange_drop_customization** to remove one.


Definition of Done
==================

- key statistics for merchant and TOPS-deployment implemented
- REST API for merchant specified and implemented
- REST API for AML officer specified and implemented
- SPAs visualize key statistics

Alternatives
============

- batch processing to compute statistics, REST API only
  returns the value computed by the last batch (computational
  cost only paid if statistic is desired, but then may be high,
  also potential for outdated data being shown);
- computing of statistics on the C side; may have more data
  easily available, but has the major disadvantage of making
  it harder to add/remove statistics and makes the transaction
  logic more complex; also easier to miss triggering events;


Drawbacks
=========

- ongoing baseline cost for statistics even if nobody looks
  at them (but only if the respective statistic is enabled)


Discussion / Q&A
================

(This should be filled in with results from discussions on mailing lists / personal communication.)