How much information?

I love this time of year. It gives me chance to get through some of books I've been been meaning to read and some podcasts I've been eager to listen to.

One of these books is The Social Life of Information, originally published at the end of the last dotcom boom by two PARC researchers, John Seely Brown and Paul Duguid (Slashot review here).

The gist of the book argues :

"that the gap between digerati hype and end-user gloom is largely due to the "tunnel vision" that information-driven technologies breed. We've become so focused on where we think we ought to be that we often fail to see where we're really going. We need to look beyond our obsession with information and individuals to include the critical social networks of which these are always a part. The Social Life of Information shows how a better understanding of the contribution that communities, organizations, and institutions make to learning, working, and innovating can lead to the richest possible use of technology in our work and everyday lives."

'How much new information is created each year?'

The book has been updated with a preface (to account for the dotcom bust) quoting a data point I looked up. The quote is:

"Since we wrote the book, a study has tried to quantify the production of information. The figures are daunting. Digital technologies currently produce between one and two exabytes per year.

...Storage does not correlate with significance, not volume with value. Standing atop gigabytes, terabytes, and even exabytes of information will not necessarily help us see further. It may only put our heads in the clouds."

The book's preface was written in 2002. It quotes a study by UC Berkley (sponsored by Microsoft Rsearch, Intel, HP and EMC) called How Much Information is at its 2003 edition and has actually upped its estimates in its 2003 update.

The answer to the question 'How much new information is created each year?' in the 2003 study was about 2 to 3 exabytes in 1999 and about 5 exabytes in 2002 (92% of which was stored in magnetic media, mostly hard disks). This equates to around 800mb per the 6.3 billion of us. This is a rough doubling of the rate of production in 3 years. In case you're wondering, an exabyte is 1,000,000,000,000,000,000 bytes OR 1018 bytes - there 1024 petabytes in an exabyte or 1,073,741,824 gigabytes in an exabyte. To give you an idea of what this means , five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress book collections (see table below).

How much of all this information is online? Very little it turns out. There are 1,048,576 terabytes in an exabyte.. The total amount of information produced in 2002 was 5,242,880 terabytes. The surface www (static pages) contained about 170 terabytes of information in 2002, about 0.003% of the total amount of information produced in 2003. Or to put it another way, there were 33,000 times more information created in 2002 than there was stored on the static web.

The high-end estimate of the amount of information held within the deep web (database-drvien) in 2002 is 92,000 terabytes of information. This represents around 1.75% of the total amount of information produced in 2002. Small fry. To contrast this, just over 8% of the world's information created in 2002 was generated by email (not including spam or marketing) - around 440,000 terabytes - equivalent to over four times the amount of 'deep' web content that existed that same year.

 

Future WWW - useful storage network of all the world's information

As I read the book and learn more about these figures two things strike me. The first is that the web has a long, long way to go before it truly becomes the storage network of all the world's information. We are only scratching the surface in this regard. Consider broadcasting for example: of the 31 million hours of original TV programming and 70 million hours of radio created in 2002, only a tiny fraction of this has ever seen the light of the net. User-generated content does even worse as a proportion of what makes it online, despite the recent progress. But even if we could get all this online we need to make all this content useful - it is one thing to publish the world's video and audio content online, it is quite another to make it searchable and relevant.

The second thing I'm reminded of, and this is the book's central area of exploration, is more of a question. The question is: we want more information available to us, but as we achieve our goals how are we going to deal with the ever increasing amount of information? As Seely Brown and Duguit put it in their first chapter:

"Despite the cheers, however, for many people famine has quickly turned to glut. Concern about access to information has given way to concern about coping with the amounts to which we do have access. The Internet is rightly championed as a major information resource. Yet a little time in the nether regions of the Web can make you feel like the SETI researchers at the University of California, Berkeley, searching through an unstoppable flood of meaningless information from outer space for signs of intelligent life."

...Faced by cheery enthusiasts, many less optimistic people resemble the poor swimmer in Stevie Smith's poem, lamenting that

I was much too far out all my life
And not waving, but drowning.

Stevie Smith, "Not Waving But Drowning," from Collected Poems of Stevie Smith. Copyright © 1972 by Stevie Smith.
Yet still raw information by the quadrillion seems to fascinate."

--

Talking of the fascination with large amounts of information, here's a table to ponder from the study:

 

Tags: Attention, social software, Web/Tech, information