General Discussion
Showing Original Post only (View all)AI companies are scraping websites to steal data so often they cost those sites & become in effect DDoS attacks [View all]
I've been seeing more and more tweets and stories about this. GenAI has already been harming the internet so much, with much more widespread deliberate misinformation, unintended misinformation pumped out by hallucinating chatbots, website traffic and revenue stolen for AI search summaries that don't give credit/links (or give only easily-overlooked credit) to the sites the info was stolen from, and what's rightly called "AI slop" from AI image generators.
But the data scraping is threatening the viability of websites more directly.
OpenAI, for instance, has been scraping at least one website hundreds of times per second.
From Edd Coates of Games UI Database - 2020 Mashable article on the founding of his site: https://mashable.com/article/game-ui-database-website-edd-coates-interview - posting on LinkedIn yesterday:
https://www.linkedin.com/posts/edd-coates-57049241_you-may-have-noticed-the-game-ui-database-activity-7237110024118935553-kk7U?utm_source=share&utm_medium=member_desktop
-snip-
Turns out OpenAI has been scraping the site by spamming the front page hundreds of times per second. So not only are they stealing my data, they're effectively DDoSing me in the process. How is this behaviour allowed from a massive organization?? Disgusting.
Please give Jay all your thanks for coming to the rescue, as well as for giving the database such a stable and reliable home over the last four years. The site simply wouldn't exist without them!
-snip-
Lots of people asking for the IP addresses to block OpenAI's crawlers. Here's the thread from Jay with all the information you'll need!
Link to tweet
Anthropic has been doing the same sort of thing:
https://www.theverge.com/2024/7/25/24205943/anthropic-ai-web-crawler-claudebot-ifixit-scraping-training-data
If any of those requests accessed our terms of service, they would have told you that use of our content expressly forbidden. But dont ask me, ask Claude! said iFixit CEO Kyle Wiens on X, posting images that show Anthropics chatbot acknowledging that iFixits content was off limits. Youre not only taking our content without paying, youre tying up our devops resources. If you want to have a conversation about licensing our content for commercial use, were right here.
The rate of crawling was so high that it set off all our alarms and spun up our devops team, Wiens tells The Verge. iFixit gets a lot of traffic. Being one of the internets top sites makes us pretty familiar with web crawlers and bots. We can handle that load just fine, but this was an anomaly.
iFixits Terms of Use policy states that reproducing, copying or distributing any content from the website is strictly prohibited without the express prior written permission from the company, with specific inclusion of training a machine learning or AI model. When Anthropic was questioned on this by 404 Media, however, the AI company linked back to an FAQ page that says its crawler can only be blocked via a robots.txt file extension.
-snip-
AI companies are constantly adding new crawlers with different names, which allows them to say they aren't ignoring sites' directions not to scrape their data, because those sites don't specify the names of the newest crawlers.
Anthropic did get enough negative publicity that they announced a change, saying they'd directed the newest scraper to respect robots.txt directives for their older scrapers - but to be blunt, there's no reason to trust any of the AI companies in their insane and illegal grabs for more training data. Some, like Perplixity AI, just turn over some scraping to third parties to do some of their rule-breaking and law-breaking for them and then try to pretend they're innocent
Website owners pay for traffic as well as storage, and scraping at these levels can drive smaller websites out of business.
Which AI companies aren't going to care about, because their goal is to have all that data themselves, to profit from it.
As for how much they're hoping to profit eventually... This was tweeted by the CEO of Perplexity AI the other day:
Link to tweet