By Brad Knowles
Part of the mutterings of the Global Village Idiot
Brad is a brilliant, wonderful guy. And as the former Sr Internet Mail Systems Administrator of American Online, knows a thing or two about scaling really large data feeds. I pulled this off a mailing list that we both participate in. My guess is that the original question from a pointy haired manager was "Well, why couldn't we filter the entire newsfeed?"
To give you an idea of the kind of scale of problem we're talking about, let's do a little math. An absolute full feed these days is averaging around 83.2GB/day, according to http://transit.us-va.remarq.com/feed-size/.
This is 85,196.8 MB/day, 87,241,523.2 KB/day, and 89,335,319,756.8 bytes per day. This works out to 3.4 GB/hour, 58.4 MB/minute, and 995.3 KB/second (7.8 Mbits/sec, or on average completely filling roughly six T-1s or five E-1s, after framing and other losses). Now, If a single piece of paper has sixty-six lines on it at eighty columns per line (a traditional measure of how much information can be stored on a typed page), this would be 5280 bytes, or 5.2 KB.
At this rate, a full feed would be 16,919,568.1 pages per day. At 500 pages per ream, this would be 33,839.1 reams per day. At twenty reams per box, this would be 1,692.0 boxes per day. If each box is about nine inches (22.86 cm) tall and they were placed on on top of the other, this would be 15,227.6 inches (38,678.1 cm), 1,269.0 feet (38.7 meters) worth of paper to scan through, on a daily basis.
If anyone gives you any hassle about this issue, I'd suggest plunking 1,692 boxes of paper on their desk and asking them to scan through all that in just a single day (noting that they have to do an average of 190 sheets per second in order to keep up ;-).
Another thing to consider -- most of these calculations were correct as of the time I originally wrote that message. However, now (March 26, 2002) an absolute full binary feed is running more like 350GB/day, not 83GB/day (see http://statistix2.xs4all.nl/diablo/ and http://doema.wirehub.nl/news/bambam/diablo.html for examples). Try multiplying the above numbers by 4.2 to see what the results would be like now.If you really want to be scared, take a look at the numbers at http://doema.wirehub.nl/news/allfeeders/. Can you imagine receiving over 600GB/day of traffic? Sending over 1TB of traffic per day? Peak traffic loads of more than 30GB/hour (68.266Mbits/sec) incoming and 50GB/hour (113.777Mbits/sec) outgoing? They do....
Since there aren't any artificially intelligent programs on this planet (that I know of) that can look at an arbitrary picture and determine the age of all of the people portrayed and the legality of their activities (ignoring all the technical details of determining just what a picture is and how it is recognized from all the other binary file types), I think it's patently obvious that automatically taking corrective action against illegal content is simply an impossible task.
You can't even automatically recognize arbitrary content, much less determine it's legality or illegality, regardless of how much time you have to process the document. And you certainly can't begin to do anything like this at these kinds of volumes.
I believe that the policy of eliminating newsgroups as you are notified that their purpose is to traffic in illegal content, or where the primary content being made available is illegal, is the only reasonable course of action we could possibly take.
Heck, the US NSA, CIA and the British GCHQ have a hard enough time intercepting and automatically cataloging all the text and automated transcripts of voice conversations with their ECHELON system (see http://www.wired.com/news/politics/0,1283,32770,00.html), and they've got virtually unlimited billions of dollars of money that they can throw at the problem. I don't see how anybody could possibly reasonably expect us to do far more, with far less money.
These are my opinions -- not to be taken as official Skynet policy
| Copyright © 1999 | Ken Mayer |