OpenAI strikes Reddit deal to train its AI on your posts

return2ozma@lemmy.world · 5 months ago

OpenAI strikes Reddit deal to train its AI on your posts

Blackmist@feddit.uk · 4 months ago

They always were.

Only now they’ve agreed to pay Reddit for it. This is what their third party lockdown was really all about.

They’re helping themselves to your Lemmy comments for free, as that’s just how it’s designed. If you post anything publicly anywhere, it’s getting slurped up by a bot somewhere.

just another dev@lemmy.my-box.dev · 4 months ago

I’m not a lawyer. But isn’t the reason they had to go to reddit to get permission is because users hand over over ownership to reddit the moment you post. And since there’s no such clause on Lemmy, they’d have to ask the actual authors of the comments for permission instead?

Mind you, I understand there’s no technical limitation that prevents bots from harvesting the data, I’m talking about the legality. After all, public does not equate public domain.

GamingChairModel@lemmy.world · 4 months ago

users hand over over ownership to reddit the moment you post

Not ownership. Just permission to copy and distribute freely. Which basically is necessary to run a service like this, where user-submitted content is displayed.

And since there’s no such clause on Lemmy, they’d have to ask the actual authors of the comments for permission instead?

It’s more of a fuzzy area, but simply by posting on a federated service you’re agreeing to let that service copy and display your comments, and sync with other servers/instances to copy and display your comments to their users. It’s baked into the protocol, that your content will be copied automatically all over the internet.

Does that imply a license to let software be run on that text? Does it matter what the software does with it, like display the content in a third party Mobile app? What about when it engages in text to speech or braille conversion for accessibility? Or index the page for a search engine? Does AI training make any difference at that point?

The fact is, these services have APIs, and the APIs allow for the efficient copying and ingest of the user-created information, with metadata about it, at scale. From a technical perspective obviously scraping is easy. But from a copyright perspective submitting your content into that technical reality is implicit permission to copy, maybe even for things like AI training.

Alimentar@lemm.ee · 4 months ago

Well even if it was a legal argument, they wouldn’t care. Like Facebook and all the rest. They say they don’t share your data but we all know that’s a lie

interdimensionalmeme@lemmy.ml · 4 months ago

They are public communication platforms, how could they not share your data publicly?

Everythingispenguins@lemmy.world · 5 months ago

Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.

assassin_aragorn@lemmy.world · 4 months ago

Only an idiot would decide to mindlessly trawl Reddit to train an LLM. They’ll be confused when their model suddenly is confidently wrong about everything and have no clue.

Everythingispenguins@lemmy.world · 4 months ago

You are a hundred percent right, but how many idiots are there out there?

assassin_aragorn@lemmy.world · 4 months ago

Uncountably many

noorbeast@lemmy.zip · 5 months ago

Finally found a use for MS Edge, loaded up Nuke Reddit History and removed all comments and posts: https://microsoftedge.microsoft.com/addons/detail/nuke-reddit-history/bklbcgohenjegdibgmppligaapohkgip

gravitas_deficiency@sh.itjust.works · 4 months ago

Hate to break it to you, but the time to do that was over a year ago, and even then it wasn’t ever really a sure thing - we don’t really know what their backup policies are around that stuff.

This is what the former power user community that made an exodus from Reddit roughly a year ago has been trying to communicate, but a ton of people here seem to enjoy keeping their toes in the water over there, with rather predictable consequences (literally, the post we’re commenting on).

All that said: I am very much looking forward to the absolutely titanic lawsuit around GDPR I’m sure is in the works over this.

AlexWIWA@lemmy.ml · 4 months ago

Not even a year ago. Reddit has been used for training data for well over a decade. We used it in 2012 in an AI class.

gravitas_deficiency@sh.itjust.works · 4 months ago

My point is that there was not a revenue-generating b2b contract allowing another company to exploit it at scale, while compensating Reddit directly.

AlexWIWA@lemmy.ml · 4 months ago

LLMs have been training on Reddit posts since at least 2012. Nothing really new here.

UnderpantsWeevil@lemmy.world · 4 months ago

It’s ground zero for Bots training on other Bots

myliltoehurts@lemm.ee · 4 months ago

So they filled reddit with bot generated content, and now they’re selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?

orca@orcas.enjoying.yachts · 4 months ago

This is actually a thing. It’s called “Model Collapse”. You can read about it here.

FaceDeer@fedia.io · 4 months ago

“Model collapse” can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.

Ghostalmedia@lemmy.world · 4 months ago

A model trained on jokes about bacon, narwhals, and rage comics.

FaceDeer@fedia.io · 4 months ago

By “old archives” I mean everything from 2022 and earlier.

BakerBagel@midwest.social · 4 months ago

But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.

FaceDeer@fedia.io · 4 months ago

Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.

Dr. Moose@lemmy.world · 4 months ago

This form of propaganda is my pet peeve. It’s not “your posts” as soon as you put something to public you don’t get to eat your cake. It’s out there, you shared it. Don’t share it if you don’t want humanity to ingest and use it.

Dataprolet@lemmy.dbzer0.com · 4 months ago

You’re technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.

SparrowRanjitScaur@lemmy.world · 4 months ago

Public information is public information.

Dataprolet@lemmy.dbzer0.com · 4 months ago

Oh boy have I bad news for you. You ever heard of copyright?

SparrowRanjitScaur@lemmy.world · 4 months ago

Have you ever heard of fair use?

OpenAI strikes Reddit deal to train its AI on your posts

OpenAI strikes Reddit deal to train its AI on your posts

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”