• Matriks404@lemmy.world
    link
    fedilink
    arrow-up
    3
    ·
    2 hours ago

    They better don’t attack too much, because all of the internet is built on FOSS infrastructure, and they might stop working, lol.

  • naught@sh.itjust.works
    link
    fedilink
    arrow-up
    9
    ·
    4 hours ago

    I have a small site that mirrors hacker news but with dark mode and stuff, and it is getting blasted by bot traffic. All the data is freely available from the official api but they’re scraping my piddling site which runs on an anemic VPS because it looks like user generated content. Bot protection does little to help from my provider. Gonna have to rethink my whole architecture now. Very annoying

  • some_guy@lemmy.sdf.org
    link
    fedilink
    arrow-up
    67
    arrow-down
    2
    ·
    17 hours ago

    In a blogpost called, “AI crawlers need to be more respectful”, they claim that blocking all AI crawlers immediately decreased their traffic by 75%, going from 800GB/day to 200GB/day. This made the project save up around $1500 a month.

    “AI” companies are a plague on humanity. From now on, I’m mentally designating them as terrorists.

  • LiveLM@lemmy.zip
    link
    fedilink
    English
    arrow-up
    76
    arrow-down
    1
    ·
    edit-2
    18 hours ago

    If you’re wondering if it’s really that bad, have this quote:

    GNOME sysadmin, Bart Piotrowski, kindly shared some numbers to let people fully understand the scope of the problem. According to him, in around two hours and a half they received 81k total requests, and out of those only 3% passed Anubi’s proof of work, hinting at 97% of the traffic being bots

    And this is just one quote. The article is full of quotes of people all over reporting they can’t focus on their work because either the infra they rely on is constantly down, or because they’re the ones fighting to keep it functional.

    This shit is unsustainable. Fuck all of these AI companies.

      • nutomic@lemmy.ml
        link
        fedilink
        arrow-up
        6
        arrow-down
        1
        ·
        3 hours ago

        Cache size is limited and can usually only hold a limited number of most recently viewed pages. But these bots go through every single page on the website, even old ones that are never viewed by users. As they only send one request per page, caching doesnt really help.

      • LiveLM@lemmy.zip
        link
        fedilink
        English
        arrow-up
        36
        ·
        16 hours ago

        I’m sure that if it was that simple people would be doing it already…

  • comfy@lemmy.ml
    link
    fedilink
    arrow-up
    14
    ·
    edit-2
    16 hours ago

    One of my sites was close to being DoS’d by openAI’s crawler along with a couple of other crawlers. Blocking them made the site much faster.

    I’d admit the software design offering search suggestions as HTML links didn’t exactly help (this is a FOSS software used for hundreds of sites, and this issue likely applies to similar sites) but their rapid speed of requests turned this from pointless queries into a negligent security threat.

  • gon [he]@lemm.ee
    link
    fedilink
    English
    arrow-up
    29
    ·
    20 hours ago

    Great write-up by Niccolò.

    I actually agree with the commenter on that post, the lack of quoting and using images is pretty bad, especially for screen-readers (which I use), and not directly linking sources (though they are made clear regardless) is a bit of a pain.

    • WorkingLemmy@lemmy.worldOP
      link
      fedilink
      arrow-up
      13
      ·
      19 hours ago

      Definitely agree. Love TheLibre as it covers subjects I don’t see hit on as often but the lack of actually linking to sources and proper quotes blows.

  • beeng@discuss.tchncs.de
    link
    fedilink
    arrow-up
    13
    ·
    20 hours ago

    You’d think these centralised LLM search providers would be caching a lot of this stuff, eg perplexity or claude.

    • droplet6585@lemmy.ml
      link
      fedilink
      English
      arrow-up
      33
      arrow-down
      1
      ·
      18 hours ago

      There’s two prongs to this

      1. Caching is an optimization strategy used by legitimate software engineers. AI dorks are anything but.

      2. Crippling information sources outside of service means information is more easily “found” inside the service.

      So if it was ever a bug, it’s now a feature.

      • jacksilver@lemmy.world
        link
        fedilink
        arrow-up
        11
        ·
        13 hours ago

        Third prong, looking constantly for new information. Yeah, most of these sites may be basically static, but it’s probably cheaper and easier to just constantly recrawl things.

    • fuckwit_mcbumcrumble@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      8
      ·
      18 hours ago

      They’re absolutely not crawling it every time they nee to access the data. That’s an incredible waste of processing power on their end as well.

      In the case of code though that does change somewhat often. They’d still need to check if the code has been updated at the bare minimum.

  • jagged_circle@feddit.nl
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    2
    ·
    edit-2
    17 hours ago

    Sad there’s no mention of running an Onion Service. That has built-in PoW for DoS protection. So you dont have to be an asshole and block all if Brazil or China or Edge users.

    Just use Tor, silly sysadmins

    • Max-P@lemmy.max-p.me
      link
      fedilink
      arrow-up
      12
      ·
      17 hours ago

      Proof of work is what those modern captchas tend to do I believe. Not useful to stop creating accounts and such, but very effective to stop crawlers.

      Have the same problem at work, and Cloudflare does jack shit about it. Half that traffic uses user agents that have no chance to even support TLS1.3, I see some IE5, IE6, Opera with their old Presto engine, I’ve even seen Netscape. Complete and utter bullshit. At this point if you’re not on an allow list of known common user agents or logged in, you get a PoW captcha.

      • lightnegative@lemmy.world
        link
        fedilink
        arrow-up
        1
        ·
        1 hour ago

        If I was a bot author intent on causing misery I’d just use the user agent from the latest version of Firefox/Chrome/Edge that legitimate users would use.

        It’s just a string controlled by the client at the end of the day and I’m surprised the GPT and OpenAI bots announce themselves in it. Associating meaning on the server side is always going to be problematic if the client can control the value

      • jagged_circle@feddit.nl
        link
        fedilink
        English
        arrow-up
        2
        ·
        11 hours ago

        Yeah but Tor’s doesn’t require JavaScript, so you dont have to block at-risk users and opress them further