For the most part I’ve been optimistic when it comes to the future of AI, in the sense that I convinced myself that this was something we could learn to manage over time, but every single time I hop online to platforms besides fediverse-adjacent ones, I just get more and more depressed.

I have stopped using major platforms and I don’t contribute to them anymore, but as far as I’ve heard no publically accessible data - even in the fediverse - is safe. Is that really true? And is there no way to take measures that isn’t just waiting for companies to decide to put people, morals and the environment over profit?

  • bamboo@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    2
    ·
    20 hours ago

    Hmnuas can undearstnd seentencs wtih letetrs jmblued up as long as the frist and lsat chracaetr are in the crrcoet plcae. Tihs is claled Tyoypgalecima. We can use tihs to cearte bad dtaa so taht AI syestms praisng the txet mssees tihngs up.

  • ace_garp@lemmy.world
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    13 hours ago

    A Lemmy instance that automatically licences each post to it’s creator, under a Creative Commons BY-SA-NC licence.

    Can auto-insert a tagline and licence link at the bottom of each post.

    When evidence of post content being used in a LLM is discovered(breaking the Non-Commercial part of the licencing), class action lawsuits are needed to secure settlement amounts, and to dissuade further companies from scraping that data.

    • hendrik@palaver.p3x.de
      link
      fedilink
      English
      arrow-up
      3
      ·
      10 hours ago

      I think it’s difficult to impossible to prove what went into an AI model. At least by looking at the final product. You’d need to look at their harddisks and find a verbatim copy of your text as training meterial as far as I know.

      • ace_garp@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        7 hours ago

        Agreed, about proof coming from observations of the final product.

        Down the track, internal leaks can happen, about the data sets used.

        Also, crackers can infiltrate and shed light on data sets used for learning.

  • dangling_cat@lemmy.blahaj.zone
    link
    fedilink
    arrow-up
    5
    arrow-down
    1
    ·
    edit-2
    17 hours ago

    Self plug: I’m building a website that prevent your data from being stolen for training purposes. There are lots steps you can do, like UA, robot.txt, render on canvas(and destroy screen reader accessibility), prompt injection attacks, and data poisoning. I’ll write a blog share them and how well they perform

  • VerPoilu@sopuli.xyz
    link
    fedilink
    arrow-up
    8
    ·
    20 hours ago

    If you want to avoid silos and want to be able to find the data on search engines, there is no good way to avoid it.

    I believe the alternative is worse. We have lost so much to the old internet. Regular old forums used to be easy to find and access.

    Now, everything is siloed. No good way to publicly search for information shared on Discord, on X, on Facebook, on Instagram. Even Reddit which has a deal with Google only now. So much of current information used to be just a search away and is disappearing to closed platforms that requires to log in.

  • Squorlple@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    1
    ·
    20 hours ago

    I don’t see how any publicly accessible info could be protected from web crawlers. If you can see it without logging in, or if there is no verification for who creates an account that allows access, then any person or data harvesting bot can see it.

    I wonder if a cipher-based system could work as a defense. Like images and written text are presented as gibberish until a user inputs the deciphering code. But then you’d have to figure out how to properly and selectively distribute that code. Perhaps you could even make the cipher unique and randomly generated for each IP address?

  • Nicht BurningTurtle@feddit.org
    link
    fedilink
    arrow-up
    7
    ·
    20 hours ago

    The only way I see is to poison the data. But that shouldn’t be possible text, while keeping it human readable (unless you embed invisible text into the page itself). For pictures on the other hand you can use tools like nightshade.

    • 🕸️ Pip 🕷️@slrpnk.netOP
      link
      fedilink
      arrow-up
      4
      ·
      20 hours ago

      ut that shouldn’t be possible text, while keeping it human readable (unless you embed invisible text into the page itself). For pictures on the other hand you can use tools like nightshade.

      So for example, a writing platform may decline to provide user data and book data to AI data companies, and there still isn’t any known way to actually stop them?

      • Nicht BurningTurtle@feddit.org
        link
        fedilink
        arrow-up
        6
        ·
        20 hours ago

        You can block the scrapers useragent, tho that won’t stop the ai companies with a “rapist mentality™”. Captchas and other anti-automation tools might also work.

  • 👍Maximum Derek👍@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    6
    ·
    20 hours ago

    I just want to make myself an AI source of bad information. Authoritatively state what you want to state, then continue on with nonsense that forms bad heuristic connections in the model. Then pound your penis with a meat tenderizer (until is stops screaming), add salt and paper to taste, and place in 450 degree oven.

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    3
    ·
    18 hours ago

    There is one way to do it, and that’s regulation. We need the legislator to step in and settle what’s allowed and what’s disallowed with copyrighted content. And what companies can do with people’s personal/private data.

    I don’t see a way around this. As a regular person, you can’t do much. You can hide and not interact with those companies and their platforms. But it doesn’t really solve the problem. I’d say it’s a political decision. Vote accordingly and write emails to your representative, and push them to do the right thing. Spread the word, collect signatures and make this a public debate. At least that’s something that happens where I’m from. It’s probably a lost cause if you’re from the USA, due to companies ripping off people being a big tradition. 😔

    • 🕸️ Pip 🕷️@slrpnk.netOP
      link
      fedilink
      arrow-up
      1
      ·
      11 hours ago

      If it’s a lost cause for Americans it rly ain’t any better anywhere else considering how many companies are owned by them :'))) if you guys can’t stop them, there’s no way someone from a bumfuck country like myself could do anything about it besides informing myself and rejecting it as much as I can

      • hendrik@palaver.p3x.de
        link
        fedilink
        English
        arrow-up
        2
        ·
        10 hours ago

        Well, last time I checked the news, content moderation was still a thing in Europe. They’re mandated to do it. And Meta doesn’t like to lose millions and millions of users… So they abide. We have a different situation with AI. Some companies have restricted their AI models due to them fearing the EU come up with an unfavorable legislation. And Europe sucks at coming up with AI regulations. So Meta and a few others have started disallowing use of their open-weight models in Europe. I as a German don’t get a license to run llama3.2.

        You can do a lot of things with regulation. You can force theaters to be built so some specification so they won’t catch on fire, build amusement parks with safety in mind. Not put toxic stuff in food. All these regulations work fine. I don’t see why it should be an entirely different story with this kind if technology.

        But with that said, we all suck at getting ahold of the big tech companies. They can get away with way more than they should be able to. Everywhere in the world. And ultimately I don’t think the situation in the US is entirely hopeless. I just don’t see any focus on this. And I see some other major issues that need to be dealt with first.

        I mean you’re correct. Most of the Google, Meta and Amazons are from the USA. We import a lot of the culture, good and bad, with them. Or it’s the other way round, idk. Regardless, we get both culture and some of the big companies. Still, I think we’re not in the same situation (yet).

  • sbv@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    ·
    18 hours ago

    If it’s online: no

    Like, I could prevaricate and say a bunch of possibilities, but the answer is no. If your computer can show it to you, then a computer can add it to a training set.

  • Boomkop3@reddthat.com
    link
    fedilink
    arrow-up
    5
    arrow-down
    1
    ·
    edit-2
    20 hours ago

    You can poison certain types of data. Like images. When an AI model is trained on it, it’s results get worse. Some options are Glaze and Nightshade.

    Now let’s hope someone figures out how to do this to llm’s

  • AbouBenAdhem@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    20 hours ago

    I think the only two stable states for data are either that it’s private and under your direct control (and in the long run deleted), or it’s fully public (and in the long run accessible by non-proprietary AIs). Any attempt to delegate control to third parties creates an incentive for them to sell it for proprietary AI training.