Is there no way to protect data against AI harvesting?

🕸️ Pip 🕷️@slrpnk.net · 3 months ago

Is there no way to protect data against AI harvesting?

Squorlple@lemmy.world · 3 months ago

I don’t see how any publicly accessible info could be protected from web crawlers. If you can see it without logging in, or if there is no verification for who creates an account that allows access, then any person or data harvesting bot can see it.

I wonder if a cipher-based system could work as a defense. Like images and written text are presented as gibberish until a user inputs the deciphering code. But then you’d have to figure out how to properly and selectively distribute that code. Perhaps you could even make the cipher unique and randomly generated for each IP address?

tyler@programming.dev · 3 months ago

Cloudflare has ai bot protection but I do not know how good it is.

🕸️ Pip 🕷️@slrpnk.net · 3 months ago

One of the reasons why I made this post is my dissatisfaction with pinterest lately, as it was one of the only major platforms I genuinely used. It’s an absolute shithole now, there’s no debate about that. If for example, a pinterest-like platform could exist without needing to be visible to non-registered users or search engines (you could search in the platform yourself anyways), would there be ways and infrastructure to make this platform safe from AI?

Squorlple@lemmy.world · edit-2 3 months ago

My first thought for a substitute that meets your criteria is a private Discord server for each of your particular interests/hobbies, but I’m fairly new to Discord and I never really understood Pinterest. In any sort of situation, if someone who is provided access to non-public info decides to take that info and run, there’s fundamentally nothing to stop them apart from consequences; ex. a general who learned classified military secrets and wanted to sell that info to an enemy state would (ideally) be deterred by punishment from the law. Sure you can muddy the water by distributing information that is incomplete or intentionally partially incorrect, but the user/general still needs to know enough to operate without error.

Long side tangent warning:

Is your concern mostly the data being harvested, or when you say “safe from AI” are you also referring to bot accounts pretending to be real users? If the latter is a concern, I’ll first copy a portion of a Reddit comment of mine from over a couple of years ago (so possibly outdated with tech advancements) when I was involved with hunting bot accounts across that site and was asked about how to protect against bots on NSFW subs:

“I have been trying for a while to figure out a way for subreddits to require users to pass a captcha before posting and/or commenting; however, I haven’t been able to find a way that works and is also easy for the mods to implement. The closest solution that I’ve seen is the verification posts which some NSFW subs require. This is example is copied from the rules for r/VerifiedFeet:

Verification Required: Start Here Posts & Comments Reported as: You MUST include “Verify me” or “Verification” in your post title. Verification steps:

1. ⁠Submit three photos using Reddit’s multiple image uploader. Show your feet with a written verification containing your username, the date, and r/VerifiedFeet. Add a fold to the paper somewhere and hold with your hands or feet. Each photo should be a different angle.

2. ⁠Add “Verify Me” or “Verification” in your post title or it won’t be seen by Mods.

3. ⁠Wait for a verification confirmation comment from the Mods to begin posting in our community.

Instead of requiring the photos to be taken against the user’s feet, perhaps other subreddits could permit it to be against any arbitrary object of the user’s choosing. Seeing as the subs which you mod are all text-based without images, my suggestion is setting those subs to Restricted, creating a new sub for users to post their verification photos, and then whitelisting the users who have done so (as well as whitelisting the users who have already posted and commented to your sub prior to this).”

Since AI has advanced enough that higher end generators can create convincing images of verification photos such as these, one thing that it cannot impersonate yet is meat. As in, you can certainly verify that somebody is a living being if you encounter them in person. Obviously, this is limiting for internet use, but on a legal level it is sometimes required for ID verification.

Or if you’re referring to human users posting AI-generated content, either knowingly or unknowingly, there’s not any way to fight that apart from setting rules against it and educating the user base.

verdigris@lemmy.ml · 3 months ago

If Discord isn’t already selling server chat logs as training data, they’re gonna start soon.

hendrik@palaver.p3x.de · 3 months ago

By the way, isn’t the whole point of Pinterest, to pin other people’s copyrighted images from other places of the web and make them accessible from within the platform? To me that whole business model sounds similar to what AI companies do. Take other people’s content and make an own product out of it.

🕸️ Pip 🕷️@slrpnk.net · 3 months ago

Yeah the monetization is what brought this downfall in the first place. It’s definitely not the same as AI, it’s sharing pre-existing work and unlike most platforms, with credit directly to the artist’s socials or webstores and was especially good for fashion designers and artist-sellers (though that also started going downhill since it got overtaken by dropshippers). I looked up online and found alternatives already such as cosmos.so which includes user owned media, not just reposting, but my issue came with then protecting said media from being scraped by AI en masse