I’ve been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?

Some context from bearblog dev

The great scrape

https://herman.bearblog.dev/the-great-scrape/

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author’s permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it’s much easier to ask for forgiveness than permission. It is unlikely they will be ordered to “un-train” their next generation models due to some copyright complaints.

Aggressive bots ruined my weekend

https://herman.bearblog.dev/agressive-bots/

It’s more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I’ve blocked close to 2 million malicious requests across several hundred blogs.

What’s wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I’m still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    12
    ·
    1 day ago

    If your concern is load, disabling anonymous access (sadly), which a lot of instances have been doing. Probably using stuff like Cloudflare and Anubis.

    If your concern is not letting scrapers have access to your posts/comments at all, that isn’t going to happen short of a massive shift away from a publicly-accessible environment. You’re gonna be stuck with private, small forums if you want that; search engines won’t index it, and you’ll have small userbases. On the Threadiverse, if someone wants to harvest your comment and post text, all they have to do is set up an instance, federate, and subscribe to every community on every instance. They don’t need to scrape at all. The only reason that bots are scraping at all is because it isn’t worth the effort, at the current scale of the Threadiverse, to bother writing special-case code for the Threadiverse to obtain text via the federated instance route.

    • turdas@suppo.fi
      link
      fedilink
      arrow-up
      2
      ·
      1 day ago

      Load is what really sucks about scraping IMO, and I wonder if the fediverse’s design makes it more or less susceptible to load precisely because the scrapers can just set up their own instances and get all data through there by federation. Time will tell, I suppose.