I’m curious to get all of your thoughts on this. It’s no secret that AI has been growing quite exponentially over the last year. I feel that new models are being released almost every other day. With that said many of these models need a tremendous amount of data to train on. It’s no secret that reddit sells its users interaction to the highest bidder. This was partially the reason why they made the changes to the API limits that got many of us to move to the fediverse in the first place.

My question is how does everyone feel with knowing that multi-billion dollar companies as scraping this instance and the others, creating extra load on the servers for nothing more than to be able to profit from it?

What can be done to continue providing a free, open network to users but prevent those who are only looking to profit from the data?

edit: fixed title typo

  • redrum@lemmy.ml
    link
    fedilink
    arrow-up
    1
    arrow-down
    2
    ·
    edit-2
    4 hours ago

    Server admins could add in the policy that any AI scrapping requires the previous permission of the copyright holders of the contents (i.e., the users) when the scrap is done for exploitation of the data for greed. Also, the robots.txt could be used to forbid AI HTML scrap.

    I don’t think that restrictions should be added at a protocol level, but, may be, some declarative tags should be fine:

    {
    "rich": "eat",
    "about-meta": "fck-genocidal-and-youth-suicidal-promoter-zuckenberg",
    "ai": "not-for-greed"
    }
    
    • Nicarlo@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 hours ago

      I think this would be the only way. It would be interesting to knowing how much traffic or requests this instance gets to see if its a real problem. Server admins could implement stricter rate limiting for non-members if it becomes an issue. They could even likely implement something that could allow them to sort out which of their members are making the most requests to have some visibility. I don’t believe this is something that is possible today from within platform anyway.

      There’s really two issues here:

      1. If users are ok and even aware that their public conversations are certainly going to be picked up and used for future models
      2. Are the lemmy instance admins ok with potentially half of their traffic going to bots that are hoarding and scrapping the data causing additional load on the servers.

      Maybe @TheDude@sh.itjust.works would be open to share some insights regarding to the amount of requests is received per month and how much resources its taking