19.76. DD 76: Paivana - Fighting AI Bots with GNU Taler#

19.76.1. Summary#

This design document describes the architecture of an AI Web firewall using GNU Taler, as well as new features that are required for the implementation.

19.76.2. Motivation#

AI bots are causing enormous amounts of traffic by scraping sites like git forges. They neither respect robots.txt nor 5xx HTTP responses. Solutions like Anubis and IP-based blocking do not work anymore at this point.

19.76.3. Requirements#

  • Must withstand high traffic from bots, requests before a payment happened must be very cheap, both in terms of response generation and database interaction. This includes good support for caching.

  • Should work not just for our paivana-httpd but also for Turnstile-style paywalls that need to work with purely static paywall pages without PHP sessions.

19.76.4. Proposed Solution#

19.76.4.1. Architecture#

  • paivana-httpd is a reverse proxy that sits between ingress HTTP(S) traffic and the protected upstream service.

  • paivana-httpd is configured with a particular merchant backend.

  • A payment template must be set up in the merchant backend (called {template_id} from here on).

Steps:

  • Browser visits {website} (for example, https://git.taler.net) where {domain} is the domain name of {website}.

  • paivana-httpd working as a reverse-proxy for {website}. Whenever called for a non-whitelisted URL, it checks for a the presence of a Paivana cookie valid for this client IP address and {website} at this time. The Paivana Cookie is computed as:

    cur_time || '-' || crock32(SHA512(website || client_ip || paivana_server_secret || cur_time)).

    where cur_time in the prefix is the current time in seconds (to keep it short) while in the hash it is usually binary GNUnet timestamp in network byte order. crock32 is GNUnet’s Crockford-inspired base32 encoding.

    • If such a cookie is set and valid, the request is reverse-proxied to upstream. Stop.

    • Otherwise, an HTTP 302 Redirect to /.well-known/paivana/templates/$ID#SITE is returned. Here, $ID is the template ID and $SITE is the website currently being visited. This way, the template page can be fully static and cached, and the JavaScript logic on that page can learn which website to pay for (and after payment go back there).

  • When the browser requests /.well-known/paivana/templates/$ID

    a static cachable paywall page is returned, including a machine-readable Paivana HTTP header with the taler://pay-template/ URL minus the client-computed {paivana_id} and fullfillment URL (see below).

  • The browser (rendering the paywall page) generates a random paivana ID via JS using the current time (cur_time) in seconds since the Epoch and the current URL ({website}) plus some freshly generated entropy ({nonce}):

    paivana_id := cur_time || '-' || b64url(SHA256(nonce || website || cur_time)).

    Here b64url is the RFC 7515 base64 URL encoder, used to keep the result short (same reason for the use of SHA-256). The same computation could also easily be done by a non-JS client that processes the Paivana HTTP header (or a GNU Taler wallet running as a Web extension).

  • Based on this paivana ID, a taler://pay-template/{merchant_backend}/{template_id}?session_id={paivana_id}&fulfillment_url={website} URI is generated and rendered as a QR code and link, prompting the user to pay for access to the {website} using GNU Taler.

  • The JavaScript in the paywall page running in the browser (or the non-JS client) long-polls on a new https://{merchant_backend}/sessions/{paivana_id} endpoint that returns when an order with the given session ID has been paid for (regardless of the order ID, which is not known to the browser).

  • A wallet now needs to instantiate the pay template, passing the session_id and the fulfillment_url as an additional inputs to the order creation (the session ID here will work just like existing use of session_ids in session-bound payments). Similarly, the {website} works as the fulfillment URL as usual.

  • The wallet then must pay for the resulting order by talking to the Merchant backend.

  • When the long-poller returns and the payment has succeeded, the browser (still rendering the paywall page) also learns the order ID.

  • The JavaScript of the paywall page (or the non-JS client processing the Paivana HTTP header) then POSTs the order ID, nonce, cur_time and website to {domain}/.well-known/pavivana.

  • paivana-httpd computes the paivana ID and checks if the given order ID was indeed paid recently for the computed paivana ID. If so, it generates an HTTP response which the Paivana cookie and redirects to the fulfillment URL (which is the original {website}).

  • The browser reloads the page with the correct Paivana cookie (see first step).

19.76.4.2. Problems:#

  • A smart attacker might still create a lot of orders via the pay-template.

    • Solution A: Don’t care, unlikely to happen in the first place.

    • Solution B: Rate-limit template instantiation on a per-IP basis.

19.76.4.3. Implementation:#

  • Merchant backend needs way to lookup order IDs under a session_id (DONE: e027e729..b476f8ae)

  • Merchant backend needs way to instantiate templates with a given session_id and fulfillment_url. This also requires extending the allowed responses for templates in general.

  • Paivana component needs to be implemented

  • Wallet-core needs support for a session_id and fulfillment_url in pay templates.

19.76.5. Test Plan#

  • Deploy it for git.taler.net

19.76.6. Definition of Done#

N/A

19.76.7. Alternatives#

  • Do not re-use the session ID mechanism but introduce some new concept. This has the drawback of us needing additional tables and indicies, and also the existing use of the session ID is very parallel to this one.

  • Instead of doing a 302 Redirect, cache control could have been achieved by specifying a “Vary: Cookie” HTTP header. We may combine these and use that to additionally enable caching of the 302 Redirect. The 302 solution has the advantage that there is only one page to cache per template, and the disadvantage of an additional redirect. Note that this is purely a frontend design choice, wallets and merchant backends work nicely with either approach.

19.76.8. Drawbacks#

  • This exposes an order ID to anyone who knows the session ID. This is clearly not an issue in this context, and for the existing uses of the session ID it also seems clear that knowledge of the session ID requires an attacker to have access that would easily also already give them any order ID, so this seems harmless.

19.76.9. Discussion / Q&A#