General Discussion

highplainsdem

(63,115 posts) Thu Sep 5, 2024, 11:29 AM Sep 2024

AI companies are scraping websites to steal data so often they cost those sites & become in effect DDoS attacks

I've been seeing more and more tweets and stories about this. GenAI has already been harming the internet so much, with much more widespread deliberate misinformation, unintended misinformation pumped out by hallucinating chatbots, website traffic and revenue stolen for AI search summaries that don't give credit/links (or give only easily-overlooked credit) to the sites the info was stolen from, and what's rightly called "AI slop" from AI image generators.

But the data scraping is threatening the viability of websites more directly.

OpenAI, for instance, has been scraping at least one website hundreds of times per second.

From Edd Coates of Games UI Database - 2020 Mashable article on the founding of his site: https://mashable.com/article/game-ui-database-website-edd-coates-interview - posting on LinkedIn yesterday:

https://www.linkedin.com/posts/edd-coates-57049241_you-may-have-noticed-the-game-ui-database-activity-7237110024118935553-kk7U?utm_source=share&utm_medium=member_desktop

You may have noticed the Game UI Database has been really laggy for the past few weeks. Well, good news! Thanks to server overlord Jay Xavier Peet, they were able to fix the issue by blocking a single IP.

-snip-

Turns out OpenAI has been scraping the site by spamming the front page hundreds of times per second. So not only are they stealing my data, they're effectively DDoSing me in the process. How is this behaviour allowed from a massive organization?? Disgusting.

Please give Jay all your thanks for coming to the rescue, as well as for giving the database such a stable and reliable home over the last four years. The site simply wouldn't exist without them!

-snip-

Lots of people asking for the IP addresses to block OpenAI's crawlers. Here's the thread from Jay with all the information you'll need!

Link to tweet

Anthropic has been doing the same sort of thing:

https://www.theverge.com/2024/7/25/24205943/anthropic-ai-web-crawler-claudebot-ifixit-scraping-training-data

The ClaudeBot web crawler that Anthropic uses to scrape training data for AI models like Claude has hammered iFixit’s website almost a million times in a 24-hour period, seemingly violating the repair company’s Terms of Use in the process.

“If any of those requests accessed our terms of service, they would have told you that use of our content expressly forbidden. But don’t ask me, ask Claude!” said iFixit CEO Kyle Wiens on X, posting images that show Anthropic’s chatbot acknowledging that iFixit’s content was off limits. “You’re not only taking our content without paying, you’re tying up our devops resources. If you want to have a conversation about licensing our content for commercial use, we’re right here.”

“The rate of crawling was so high that it set off all our alarms and spun up our devops team,” Wiens tells The Verge. “iFixit gets a lot of traffic. Being one of the internet’s top sites makes us pretty familiar with web crawlers and bots. We can handle that load just fine, but this was an anomaly.”

iFixit’s Terms of Use policy states that “reproducing, copying or distributing” any content from the website is “strictly prohibited without the express prior written permission” from the company, with specific inclusion of “training a machine learning or AI model.” When Anthropic was questioned on this by 404 Media, however, the AI company linked back to an FAQ page that says its crawler can only be blocked via a robots.txt file extension.

-snip-

AI companies are constantly adding new crawlers with different names, which allows them to say they aren't ignoring sites' directions not to scrape their data, because those sites don't specify the names of the newest crawlers.

Anthropic did get enough negative publicity that they announced a change, saying they'd directed the newest scraper to respect robots.txt directives for their older scrapers - but to be blunt, there's no reason to trust any of the AI companies in their insane and illegal grabs for more training data. Some, like Perplixity AI, just turn over some scraping to third parties to do some of their rule-breaking and law-breaking for them and then try to pretend they're innocent

Website owners pay for traffic as well as storage, and scraping at these levels can drive smaller websites out of business.

Which AI companies aren't going to care about, because their goal is to have all that data themselves, to profit from it.

As for how much they're hoping to profit eventually... This was tweeted by the CEO of Perplexity AI the other day:

Link to tweet

8 replies

= new reply since forum marked as read

Highlight:

AI companies are scraping websites to steal data so often they cost those sites & become in effect DDoS attacks (Original Post) highplainsdem Sep 2024 OP

AI is bigger thief than Cheato bucolic_frolic Sep 2024 #1

Great campaign signs! As for AI being a thief - more tech bros than you'd normally expect are backing highplainsdem Sep 2024 #5

Hopefully, not to overuse the word... Hugin Sep 2024 #2

Thanks, Hugin! I'd never heard of that book or Christopher Alexander, but after reading that Wikipedia highplainsdem Sep 2024 #6

I am happy you found value in it... Hugin Sep 2024 #8

Fantasia - The Sorcerer's Apprentice (Part 3) cbabe Sep 2024 #3

Perfect! highplainsdem Sep 2024 #7

Tweet about this from artist Reid Southen: highplainsdem Sep 2024 #4

bucolic_frolic

(55,847 posts)

1. AI is bigger thief than Cheato

Reply to highplainsdem (Original post)

Thu Sep 5, 2024, 11:33 AM

Sep 2024

I saw campaign signs this week

34 Felonies!

Trump is a Weirdo!

These were officially printed somewhere.

highplainsdem

(63,115 posts)

5. Great campaign signs! As for AI being a thief - more tech bros than you'd normally expect are backing

Reply to bucolic_frolic (Reply #1)

Thu Sep 5, 2024, 02:55 PM

Sep 2024

Trump, in large part because they think he's more likely to let them get away with this sort of thing.

Hugin

(38,002 posts)

2. Hopefully, not to overuse the word...

Reply to highplainsdem (Original post)

Thu Sep 5, 2024, 12:08 PM

Sep 2024

Last edited Thu Sep 5, 2024, 03:43 PM - Edit history (1)

This is WEIRD and ridiculous.

It’s their crypto mining exploitation of owning “Allz the dataz.” mindset shining through. As if somewhere in incessantly scraping every scintilla of bits from the Internet of Internets will somehow magically yield a golden bullet which will make their dreams come true and Generative AI something more than a huge waste of time and resources. Of course, this is all being driven by the pride of those who have oversold themselves on a pipe dream.

Let’s pretend that for one second I believed that that this fever was leading to something useful and desirable for human kind. A better way to proceed would be to let it grow organically. Have the crowd of humans slowly direct it’s growth to something they may find benefit from. But, alas, that would take time! Also, who would profit? Which is all they are really interested in.

The architects of this craziness need to take a long weekend off, hop on their retro beanbags, and read “Notes on the Synthesis of Form” - Christopher Alexander. (I’m sure they’ve stolen a copy already.) Let it percolate for a few days on slow walks in nature. Then, calm the fuck down.

https://en.wikipedia.org/wiki/Notes_on_the_Synthesis_of_Form

I know this is unlikely to happen. The draw of FOMO is far too strong.

highplainsdem

(63,115 posts)

6. Thanks, Hugin! I'd never heard of that book or Christopher Alexander, but after reading that Wikipedia

Reply to Hugin (Reply #2)

Thu Sep 5, 2024, 03:22 PM

Sep 2024

article I suspected he probably had great appeal to software engineer Grady Booch, whom I follow on Twitter for his comments on AI. So I searched...

Link to tweet

Saw tweets with others talking to him who mentioned Alexander, and also saw this:

Link to tweet

The architects of this craziness need to take a long weekend off, hop on their retro beanbags, and read “Notes on the Synthesis of Form” - Christopher Alexander. (I’m sure they’ve stolen a copy already.) Let it percolate for a few days on slow walks in nature. Then, calm the fuck down.

Love your suggestion there! But you're right that greed and FOMO make it very unlikely to happen. And too many of the AI bros believe that if they can just develop superintelligent AI, ASI, it will help humans merge with machines and become godlike in power, and immortal.

Hugin

(38,002 posts)

8. I am happy you found value in it...

Reply to highplainsdem (Reply #6)

Thu Sep 5, 2024, 03:41 PM

Sep 2024

I have long believed that this is the special sauce that makes real innovation appear to be magic.

It’s organic and bottom-up. Also, difficult to monetize.

cbabe

(6,822 posts)

3. Fantasia - The Sorcerer's Apprentice (Part 3)

Reply to highplainsdem (Original post)

Thu Sep 5, 2024, 01:00 PM

Sep 2024

https://m.

highplainsdem

(63,115 posts)

7. Perfect!

Reply to cbabe (Reply #3)

Thu Sep 5, 2024, 03:25 PM

Sep 2024

highplainsdem

(63,115 posts)

4. Tweet about this from artist Reid Southen:

Reply to highplainsdem (Original post)

Thu Sep 5, 2024, 02:51 PM

Sep 2024

Link to tweet

Reid Southen
@Rahll

This was 100% AI bot scraping, this is becoming an unbelievably common occurrence. Not only are these companies stealing our work, which costs us, but they're driving up everyone's server bills as well. We're literally paying for them to take our work. This needs to be illegal.

rileyb3d
@rileyb3d
I had to shut down http://rileyb3d.com because my AWS monthly budget was reached in like an hour. I don’t know who, but it’s not general/normal traffic.

I wish I was more knowledgeable at tracking and blocking this stuff. I’m looking for a secure way to get it back.

Reply to this discussion

Kick in to the DU tip jar?

This week we're running a special pop-up mini fund drive. From Monday through Friday we're going ad-free for all registered members, and we're asking you to kick in to the DU tip jar to support the site and keep us financially healthy.

As a bonus, making a contribution will allow you to leave kudos for another DU member, and at the end of the week we'll recognize the DUers who you think make this community great.