The biggest issue with generative AI, at least to me, is the fact that it’s trained using human-made works where the original authors didn’t consent to or even know that their work is being used to train the AI. Are there any initiatives to address this issue? I’m thinking something like an open source AI model and training data store that only has works that are public domain and highly permissive no-attribution licenses, as well as original works submitted by the open source community and explicitly licensed to allow AI training.

I guess the hard part is moderating the database and ensuring all works are licensed properly and people are actually submitting their own works, but does anything like this exist?

  • General_Effort@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    edit-2
    1 day ago

    For images, yes. Most notable is probably Adobe. Their AI, which powers photoshop’s generative fill among other things, is trained on public domain and licensed works.

    For text, there’s nothing similar. LLMs get better the more data you have. So, the less training data you use, the less useful they are. I think there are 1 or a few small models for research purposes, but it really doesn’t get you there.

    Of course, such open source projects are tricky. When you take these extreme, maximalist views of (intellectual) property, then giving stuff away for free isn’t the obvious first step.

    • kadup@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      1 day ago

      It’s also very hard to keep track of licenses for text based content on the internet. Do most users know what’s the default licence for their comments on Reddit? How about Facebook? How about the comments section of a random blog? How about the title of their Medium post? And so on

      • General_Effort@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        21 hours ago

        The usual tends to be that the platform can do basically whatever. That shouldn’t really be surprising. But I see your point. If you literally want consent, not just legally licensed material, then you need more than just a clause in the TOS.

        You could raise the same issue with permissively licensed material. People who released it may not have foreseen AI training as a use, and might not have wanted to actually allow it.

        • kadup@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          21 hours ago

          Exactly - the platform owner usually can do everything. Can a third party crawler? I don’t know

          • General_Effort@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            18 hours ago

            You mean legally? Yeah, no problem. It depends on the location, though. In the EU, the rights-holder can opt out. So if you want to do it in the EU you have to pay off Reddit, Meta, and so on. In Japan, it’s fine regardless. In the US, it should turn out similarly, but it’s up to the courts to work out the details, and it’s quite up in the air if you can trust the system to work.