General Discussion
In reply to the discussion: What did Edward Snowden get wrong? Everything [View all]Pholus
(4,062 posts)Here's the size of the problem. I will use the numbers from the guy who runs the "Internet Archive" who has a bit of knowledge about storing lots of data. He estimates starts by estimating 315 million US Citizens at 300 minutes per month each.
http://blog.archive.org/2013/06/15/cost-to-store-all-us-phonecalls-made-in-a-year-in-cloud-storage-so-it-could-be-datamined/
In raw data, he arrives at 272 petabytes of talk data per year in the US at a very generous compression rate. He even says it is reasonable to store a year of this data "in the cloud" for about 30 million dollars or so in a server room less than 10% the size of the Utah Data Center.
What could you do with that 272 petabytes to make it usable? All Snowden's stuff implies keyword based searches. DARPA has been VERY interested in speech to text since way back in 2002 when it was part of Bush's TIA so it is probably reasonable that this is how they work. Let's presume they add a keyword option to analyst searches with the option to go back to review the original audio.
I have no idea about how text-to-speech works, but you have to process the audio stream. Just for convenience I'll claim that it is as computationally hard as turning an avi file into an mp3 file. It can't be much harder, my idiot cell phone can do a tolerable job in a tiny portable processor in real time.
How big a computer would you need to process 272 petabytes of data in a year. Tom's Hardware benchmarks computers in part by converting a 178 MB wav file into an mp3. A mid-range modern processor does it in about a minute and a half. A single one of these computers can convert all 272 petabytes in 2.5 billion minutes. There are half a million minutes in a year, so 5000 processors could do the job in one year.
That's about half the size of one of the "top 500" clusters in existence. Surely the NSA can afford ONE of the 500 most powerful computers in the world, right? And a hell of a bunch of people each individually smarter than me to code for it and manage it.
But can they handle that much data?
The NSA has just admitted in their white paper that they "touch" (read as hoover up) 1.6% of 1826 Petabytes per day or 29 petabytes per DAY when it comes to email. Phone calls would seem to be easier than that, coming to just a bit less than 745 Terabytes per day. It's easier to do phone than the internet stuff.
Just remember, to find the needle in the haystack, you need a haystack. What was General Alexander's nickname again? Oh yeah, "Collect it all." Collection is easy, analysis is hard. That's why this system has no high profile successes -- they're still learning to use it.