General Discussion

blogslut

(39,191 posts) Sun Apr 23, 2017, 10:57 AM Apr 2017

Internet Archive to ignore robots.txt directives [View all]

https://boingboing.net/2017/04/22/internet-archive-to-ignore-rob.html

Robots (or spiders, or crawlers) are little computer programs that search engines use to scan and index websites. Robots.txt is a little file placed on webservers to tell search engines what they should and shouldn't index. The Internet Archive isn't a search engine, but has historically obeyed exclusion requests from robots.txt files. But it's changing its mind, because robots.txt is almost always crafted with search engines in mind and rarely reflects the intentions of domain owners when it comes to archiving.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

9 replies

= new reply since forum marked as read

Highlight:

Internet Archive to ignore robots.txt directives [View all] blogslut Apr 2017 OP

I disagree with this. If you own a web site, it should be your decision. 50 Shades Of Blue Apr 2017 #1

Their description of what they're doing makes perfect sense. PSPS Apr 2017 #2

I disagree. I think they should respect robots.txt files. 50 Shades Of Blue Apr 2017 #3

I don't. Things on the net shouldn't be allowed to disappear down the memory hole emulatorloo Apr 2017 #6

If you do not want content hosted on a domain you own to MineralMan Apr 2017 #7

I disagree with you. Your argument is not logical. bitterross Apr 2017 #9

I've got a "book" (80 pp) on Internet Archives. I don't understand the article, but am interested. UTUSN Apr 2017 #4

robots.txt is a very, very weak tool. MineralMan Apr 2017 #5

I'm unexpectedly ambivalent about this. hunter Apr 2017 #8