Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

erronis

(24,938 posts)
Wed Jun 24, 2026, 07:39 PM 21 hrs ago

Web archive lets you easily search millions of government documents

https://phys.org/news/2026-06-web-archive-easily-millions-documents.html
by Stefan Milne, University of Washington

This looks to be operational although they say it is still indexing content.

https://govscape.net/

At the end of every presidential term, the End of Term Web Archive preserves that administration's web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush's second term, and runs through 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.

A University of Washington-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like "FAFSA," or use a semantic search, which finds documents on a topic even if the exact search terms don't appear on the page. A visual search option lets them query for qualities like "redacted documents," "aerial photographs" or "pie charts." The system can currently search the 10 million PDFs hosted online during Donald Trump's first term; the team plans to expand it to the whole archive.

Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers $1 to parse around 100 pages with AI.

The team will present its research July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego. The work is published on the arXiv preprint server.

. . .
Latest Discussions»General Discussion»Web archive lets you easi...