Be careful with what you install on your phones. If the findings in this article are ...

Be careful with what you install on your phones.

If the findings in this article are true, then most of the bot traffic that has recently taken down many small and independent websites (code forges in the first line) comes from a quite sophisticated network of scrapers sold as services.

My small Forgejo instance has also experienced brief downtime and slowness a couple of weeks ago, but luckily nothing compared to the instances of Gnome and KDE (which had to implement aggressive captcha to mitigate the flood). Basically anything that isn’t behind Cloudflare is a potential victim.

The pattern is mostly the same in all these cases. Residential IP addresses and legitimate user agents that don’t advertise themselves as bots, let alone honor the robots.txt files, making life for the sysadmins who try to block this traffic very hard.

These bots also request heavy pages (such as git logs and blames) in large volumes, which has taken down a lot of Gitlab, Gitea and Forgejo instances.

The business model behind this phenomenon seems to be quite sophisticated.

As a developer of a mobile app, I can include the SDK of a product like Infantica inside of my code. It doesn’t even have to be my own app. There have been cases where other people’s apps were simply repackaged with these SDKs ans redistributed on stores.

That SDK in turn transforms any device it’s been installed on into a member of a vast botnet without the user’s consent or knowledge.

The customers are usually companies that want to train large AI models, but can’t afford the costs (or simply don’t want to pay them, or have a limited pool of IP addresses for scraping that may easily be blocked by sysadmins).

What they do then is pay companies like Infantica to leverage infected devices (i.e. mostly mobile phones with apps that include their SDK) to scrape the web for them and push data wherever they want.

Developers who include the SDK in their apps also get a share of the pie - hence the financial incentive to repackage and redistribute even 3rd-party apps with the incriminated SDK: minimize the development effort, maximize the revenue.

Of course, the commands that “customers” can send to the botnet aren’t limited to scraping and training for AI purposes. It’s just that this is what currently pays best (it used to be crypto mining until a while ago). In theory, nothing prevents them from sending commands to access anything on the infected devices. Of course, companies like Infantica claim that they do their due diligence and scan all usages of their products to prevent abuse, but when a company already has such low moral standards you know how to take their claims.

Note that what until a couple of years ago would have been called “a zombie device infected with nasty malware that turns it into a botnet member at the mercy of whatever the best paying customer wants to do with it” has now been repackaged as a legit business product with its own business jargon. They are now called “residential rotating IP addresses that form an insightful peer-to-business network”.

And the volumes are also scary. Infantica alone claims that it can sell access to nearly 250K IPs in the US alone. That’s nearly one American in 1000. And when you take into account that there are dozens of companies that operate in the same sector, the volumes become scarier.

Unfortunately it’s hard for non-technical users to know which apps run such SDKs, and if there are such apps already installed on their phones. But there are a few precautions that can be taken to mitigate the risk.

First, avoid mobile apps when possible. Their potential abuse as AI scrapers is only the latest threat that they pose. They have a lot of privileges once installed and have a huge surface of attack. It’s ok to have an app for your camera. Whether it makes sense to have an app to check discounts at your local store, it’s debatable. Use websites instead of apps whenever possible. Many of them can be installed on your phone nearly as a full app through the PWA paradigm, but since those Webapps will always be sandboxed inside your browser they can’t do much damage. And always, always avoid whenever possible products whose website is a single “Download our app” page. There’s a reason why we decided that an open web is better than a bunch of closed apps, and we should punish those who don’t agree with those reasons.

When you have no choice but to install an app, always look for comparable alternatives on e.g. F-Droid. Apps on open-source stores have much more scrutiny than whatever crap is uploaded to the Android and Apple stores. Each app is monitored for any external connections, and those are marked as anti-features. Plus, each app is forced to share its source code. Google and Apple have their big responsibilities for this mess. If an Android SDK exists that turns phones into botnet zombies that can run arbitrary payloads, then that SDK should be considered as malware. Period. Any app that includes that SDK in its dependencies or includes any of those packages should be automatically flagged and removed from the store. The fact that this doesn’t happen, and millions today run infected software on their phones downloaded from legitimate app stores, means that Google and Apple are either grossly negligent or grossly corrupt - in either case, they can’t be trusted for the safety of the software you download from their stores.

And, when you have no choice but to get an app from an official store, always prefer alternative store frontends like Aurora, which at least scans the apps from the Play Store and transparently informs you about any trackers and data access patterns.

Finally, I disagree with the last stance in this article - that every form of web-scraping should be considered abusive behaviour. Scraping is one or the foundational pillars of the Web as we know it today. And the vision of a Web accessible both to humans and machines is a foundational pillar of the semantic Web. It’s not scraping the problem. But, for scraping to be a game where everyone wins, two issues must be solved:

The right to scraping needs to be symmetric. If Google, Meta or Microsoft can freely scrape my websites to train whatever AI hyped bullshit they want to train with it, then I also have the right to scrape their services. If instead they can eat my blog’s RSS feed or my monthly code commits for breakfast, but scraping my Facebook homepage to automatically expose my friends’ birthdays through another service may result in my account being banned, then we have a problem.

The unfortunate alignment of financial incentives and impunity in recycling what until a couple of years ago was basically a criminal activity (installing malware on people’s devices) into a legitimate business model with shiny business-friendly websites and account managers. I don’t mind a world where bots identify themselves as bots through standard user agents, so I can easily block them if I want to, respect my robots.txt settings, and sensibly throttle their requests. But I have a problem with a world where all these gentlemen’s agreements are broken, where the costs of training expensive AI models are so explicitly externalized, and paid by thousands of independent Web administrators through electricity costs, performance degradation costs and downtime management costs, and where those who break the rules are free to operate as listed companies instead of being in jail, and where their malware is allowed to spread through standard software distribution channels.

https://jan.wildeboer.net/2025/04/Web-is-Broken-Botnet-Part-2/

Fabio Manganiello on Nostr: Be careful with what you install on your phones. If the findings in this article are ...