DD 59: Statistics ################# Summary ======= This design document elaborates how we track various statistics in the exchange and merchant, typically for tax reporting or to detect anomalies to be investigated by anti-money laundering officers. The key idea is to use SQL triggers to keep the statistics always up-to-date and a bit of garbage collection to expire ancient statistics. Finally, deployment-specific statistics can easily be added this way by simply injecting the correct SQL code into the backend, without having to modify the core exchange or merchant logic. Motivation ========== Exchange operators are required to monitor for suspicious transactions as part of their AML efforts. Merchants need to collect certain data for their business, especially for tax purposes but also conceivably to analyze sales. The specific data to be tracked varies by operator (and legislation), so we need to be quite flexible in terms of which statistics should be kept, especially to minimize the performance impact. Requirements ============ - statistics should always be up-to-date (in real-time) and not only be updated in batches - some statistics are amounts, others are simple numerical (integer) values - some statistics need to be kept over a sliding interval that moves over time, while others need to be mapped to fixed buckets such as a day, month, quarter or year. - which statistics are being tracked may depend on the operational context, especially for the exchange; it must thus be easy to add (or remove) statistics at any time; - while tracking statistics inherently costs performance the runtime (CPU and storage) overhead should be minimized; in particular for "sliding intervals", the events that "slide out" of the interval may be coarsened and do not necessarily require accuracy down to the second; - when adding new statistics, it may be desirable to compute them retroactivey over historic data (if available); - the SPAs displaying statistics should not have to make an excessive number of REST API calls, so generally multiple values should be returned from a single endpoint; - the merchant is multi-currency capable, and thus amount-valued statistics in the merchant backend should be kept per currency Proposed Solution ================= At a high-level, we use SQL TRIGGERs to update statistics on-the-fly whenever an INSERT or UPDATE changes values in the database that would affect the statistics we are tracking. The statistics themselves are also stored in the database. Additionally, we also store meta-data in the database that determines the granularity of the statistics to keep. This way, the SQL TRIGGERs can be generic. Tables ------ The schema consists of several tables: - **statistic_bucket_meta**: meta data for statistics we track in buckets, including a slug used as the main key to identify the statistic, the data type (amount or number), and a pair of range of the bucket (year, quarter, month, etc.) and how many ages (generations) of that bucket to keep. Basically, now minus age times range tells us how far into the past statistics are kept. - **statistic_interval_meta**: meta data used for statistics we keep by time interval, with again a slug and data type, but then pairs consisting of a number of seconds for the range and a precision. A range of 60s implies that we count the events of the last 60s in the interval. Usually multiple ranges are given, like 60s, 120s and 180s to keep events for the last one, two and three minutes. Note that in the database the data kept for the 120s range excludes the data from the first 60s, and the correct amount for any interval is computed on-the-fly by adding up the amounts for the smaller intervals. A precision of 5s means that event timestamps are normalized (and rounded down) to multiples of 5s and thus events at 64s may still be counted for the 60s interval. Note that both intervals and precisions must be monotonically increasing in their respective arrays. - **statistic_bucket_counter**: contains the current numeric (non-amount) value for a particular bucket (given by starting point and range) and object; objects in the exchange are (normalized) payto hashes identifying bank accounts, while objects for the merchant backend are the instances of the merchant. - **statistic_bucket_amount**: same as the counter bucket, except for an amount; in the case of the merchant the *currency* is an additional dimension as amounts are tracked per currency. - **statistic_counter_event**: represents the sum of all (counter) events at a particular time-*slot* in time, again given per object; the *slot* is the time of the event rounded to the *presision* of the interval the event falls into. - **statistic_amount_event**: same as the counter event, except for amounts. - **statistic_interval_counter**: cumulative value of the counter for the given range; additionally includes the unique ID of the oldest event that is included in the counter (as the range is given as a relative time, we need to track which events are included). - **statistic_interval_amount**: same as the interval counter, except for amounts. Stored procedures ----------------- We additionally provide a few stored procedures to update the statistics. These are: * **bump_number_bucket_stat** (slug,object,timestamp,delta): increases the *slug* bucket counter of *object* at *timestamp* by *delta*. Does nothing if *slug* is not in the bucket meta table or if *timestamp* is past the range of buckets we are currently tracking; * **bump_amount_bucket_stat** (slug,object,timestamp,amount): similar to the above, just for statistics of type amount; * **bump_number_interval_stat** (slug,object,timestamp,delta): increases the *slug* interval counter of *object* at *timestamp* by *delta*; * **bump_amount_interval_stat** (slug,object,timestamp,amount): similar to the above, just for statistics of type amount; * **bump_number_stat** (slug, object, timestamp, delta): increases both bucket and interval counters for *slug* of *object* at *timestamp* by *delta*; * **bump_amount_stat** (slug, object, timestamp, delta): similar to the above, just for statistics of type amount; We furthermore provide a few stored procedures to access the interval statistics (while making sure they are current). These are: * **statistic_interval_number_get** (slug, object): returns all non-zero counters and time intervals that we are tracking for the given *slug* and *object*; the intervals are updated and events are possibly discarded or coarsened (if numeric value of the precision for larger intervals is larger and thus allows for more rounding); * **statistic_interval_amount_get** (slug, object): returns all non-zero amounts and time intervals that we are tracking for the given *slug* and *object*; the intervals are updated and events are possibly discarded or coarsened (if numeric value of the precision for larger intervals is larger and thus allows for more rounding); Finally, there are some helpers for cleaning up: * **statistic_bucket_gc** (): Removes buckets and events past the range for which we track statistics; also coarsens events to the precision of the range into which they are falling. * **exchange_drop_customization** (schema): special function for the exchange which allows dropping customizations by *schema*. Schema ------ For the exchange, all triggers (and associated stored procedures) should be stored in one or more payment service provider specific schema. Furthermore, the name of the schema MUST be provided in the *schema* column of the meta data table entries. This way, all triggers, stored procedures and statistics can be removed simply by DROPing the SCHEMA and removing the associated entries from the meta data tables (the foreign key constraints then CASCADE and clean up the statistics themselves). The SCHEMA name must also be used as the prefix for the SQL files that inject the triggers. The usual database versioning should also be used, except this time using the SCHEMA name instead of "exchange-". "exchange" is not allowed as a name for customization SCHEMA. **taler-exchange-dbinit** is extended with command line arguments to load the latest version of a customization schema or to call **exchange_drop_customization** to remove one. Definition of Done ================== - key statistics for merchant and TOPS-deployment implemented - REST API for merchant specified and implemented - REST API for AML officer specified and implemented - SPAs visualize key statistics Alternatives ============ - batch processing to compute statistics, REST API only returns the value computed by the last batch (computational cost only paid if statistic is desired, but then may be high, also potential for outdated data being shown); - computing of statistics on the C side; may have more data easily available, but has the major disadvantage of making it harder to add/remove statistics and makes the transaction logic more complex; also easier to miss triggering events; Drawbacks ========= - ongoing baseline cost for statistics even if nobody looks at them (but only if the respective statistic is enabled) Discussion / Q&A ================ (This should be filled in with results from discussions on mailing lists / personal communication.)