+++ title = "Can antibots be both efficient and anonymous?" date = 2025-09-20 description = "todo" insert_anchor_links = "left" draft = true [taxonomies] tags = ["cryptography"] [extra] katex = true +++ Some people with a lot of money (or, at least, who control a lot of machines) decided to flood the Internet with useless requests, crawling every website respecting neither robots.txt nor anything else. They even made some services shut down, [effectively performing a DDoS](https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html). All that to train AI with garbage data so you can ask a chatbot to quote Wikipedia or StackOverflow instead of using a proper search engine. The Internet relies on a heap of protocols that only work well when people behave correctly: it stops being efficient when someone gains too much power (bandwidth and IP addresses). Cloud providers indeed provide bad guys with enough clouds to make a storm, not to talk about the "Internet of Things" that allows botnets to run on security cameras, baby monitors and sextoys. One of the most common practices on the Internet is fundamentally altruistic: giving a copy of a file to whomever is asking for it, for free (what it commonly called "Web"). The problem is that answering such a request consumes a machine's resources (energy, computing time, IO time, memory, etc.), resources that can be exhausted if people are asking too much. ## A few solutions ### Rate-limiting IP addresses The most basic solution of counting traffic per IP address is probably the first thing to do. Sadly it is not enough anymore, as it cannot detect [botnets](https://www.trendmicro.com/vinfo/us/security/news/vulnerabilities-and-exploits/a-closer-exploration-of-residential-proxies-and-captcha-breaking-services) or even large IP regions owned by a single entity. Lowering the threshold also leads to false positives, blocking an entire school or office when dozens of people are sharing the same IP address. ### Proof of intelligence Captchas ask the user to solve a problem that is (supposedly) difficult for a computer, but (supposedly) easy for a human. However they take time to solve even for a human, they are not accessible to people who can't see or hear or have mental disabilities, and modern AIs can already solve them. The irony is that their main purpose is not to help us filtering bots, but to help Google training AI: first it was character recognition for scanning old books, now it is image categorization for self-driving cars and voice transcription for voice assistants and targetted advertising. ### Proof of browser Systems that do not require user input can check whether they are being run in a proper web browser, by testing various features. However they can be fooled by giving more power to the bot's engine (e.g. using [Selenium](https://www.selenium.dev/)), which then becomes indistinguishable from a browser. ### Proof of work Proof of work imposes to solve problems that are long to solve, but fast to check. A few seconds of computing time is needed to solve the challenge. The difficulty must be well balanced so it is fast enough for a legitimate user, but too expensive for a spammer who sends thousands of requests per second. However this appears not to frighten spammers anymore, as Anubis (an antispam system based on proof of work) failed to stop some attacks. It is likely that the gap between low-end computers and AI servers or botnets will only get larger, making PoW a non-viable solution. ### Global monitoring If you are big enough to have a global database of real time traffic per IP address (e.g. CloudFlare, Amazon, Google, etc.), you can detect spammy addresses and stop them immediately. However such a centralized solution is not acceptable, as it gives too much power to gigantic corporations and create single points of failure (see the large scale CloudFlare and AWS outages in 2025). Decentralized and anonymous spam databases may be an interesting research subject but it seems quite complicated and insufficient against sudden attacks using disposable IP addresses. ### Political solutions Why is spamming even possible? * Being rich or being funded by banks allows people to acquire a huge amount of physical resources (computers, network links), even against the will of the community. * The business model of art makes profitable to produce AI-generated garbage thanks to advertising and big platforms which do not fulfill the role of a proper editor, to the detriment of artists. * Poor people are encouraged to join botnets of residential proxies to earn a monthly few dollars by proxying requests. Here are some solutions: * Investments that have considerable consequences on other people should be discussed (and potentially vetoed) by all the people involved, including beneficiaries, workers, suppliers, neighbours, and in that case, Internet users. * Everyone should have enough revenue to live decently, unconditionally. Then, nobody will be forced to sell bandwidth or workforce for AI training to survive. * Artists should be funded for their work, not for what they sell. Then, producing low-quality content cannot be more profitable than making art you love. * Dividing trusts into smaller, decentralized entities. Big Internet service providers, hosters and social media platforms have too much power over economy and culture. The same goes with entertainment companies, who produce blockbusters and reduces art's diversity. * Fund free software as common goods or public services, so end users are not forced to see AI and ads appear after updating their system. All this is part of what I call socialism. (and to clarify, there is no dictatorship involved) Sadly, we also need short-term solutions, so let's introduce a more technical one. ## Toward a decentralized and privacy-sound solution We will explore ways to provide global human-wise rate-limiting without central entity and respecting privacy. ### Linkable ring signatures [TODO explain LRS] Suppose a public identity set, linking public keys to verified unique persons. Each week, each identity can declare a limited number of temporary public keys (TPK) in a signed bundle. At the end of the week, the TPK superset is frozen into read-only. When making a request, a client linked to an identity will choose one of its unused TPKs plus some number of other people's TPKs at random (avoiding picking two TPKs from the same identity). These TPKs will form a ring against which the request is signed. The request is sent with the ring and the signature attached. The server verifies that the ring is included in the TPK superset then verifies the signature. It stores the signature's linking tag and increments its request count. It responds with a token that authenticates the client and enables to count its requests without making new signatures. The client can reuse its token (or sign again with the same ring) as long as it remains identifiable by other means (typically, when using the same IP and user-agent). To benefit from caching and amortization of verification, which is possible with some logarithmic schemes, rings may be defined by a public pseudorandom function, so there is a non-negligible chance that many users share the same ring. Example cost: https://eprint.iacr.org/2024/553.pdf 29kB per signature with rings of size 1024, verification takes 128ms then 0.3ms after amortization. ### PrivacyPass https://datatracker.ietf.org/doc/html/rfc9576 https://www.rfc-editor.org/rfc/rfc9577.html * Joint Attester, Issuer, Origin * Attestation request must be LRS. * Token can be anything. * Joint Attester, Issuer * Attestation request must contain a tmp pk, signed by LRS. * Token must be a certificate of the tmp pk. * Joint Issuer, Origin * Attestation request must be LRS. * Attestation request must contain a tmp pk, signed by LRS. * Token can be anything. Retained architecture: joint issuer and origin. * Client generates tmp kp * Client sends attestation request with tmp pk, signed by LRS * Attester responds with timestamped tmp pk certificate (attester pk + tmp pk + time + sig = 136 bytes) * no need to be PQ here, as the certificate is short-lived * Client makes request to the server * Server (issuer) asks for certificate * Client sends certificate * Server (issuer) responds with token