But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.
Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.
So they filled reddit with bot generated content, and now they’re selling back the same stuff likely to the company who generated most of it.
At what point can we call an AI inbred?
This is actually a thing. It’s called “Model Collapse”. You can read about it here.
“Model collapse” can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.
A model trained on jokes about bacon, narwhals, and rage comics.
By “old archives” I mean everything from 2022 and earlier.
But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.
Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.