Data from Distributed Nodes

I mentioned a presentation the other day that we got from a scientist from Johns Hopkins University. He mentioned that they had multiple Petabyte projects going on. Last night as I was flying to Tucson Arizona I read several articles in the “Communications of the ACM” magazine that also discussed massive data stores. The interesting thread in both of these is the source of that data. Keep in mind that a Petabyte of data represents an amazing amount of storage. It’s a quadrillion bytes – and if you do the math, it would be difficult for a single set of users in a company to input that much data. So where does all that input come from?

Turns out in most of these examples the data is coming from inexpensive sensors – small devices that sense the state of an object and then translate that to an input to the database system. The large hadron collider, for instance, will generate multiple petabytes each year in this fashion.

So what does that mean to you as a data professional? Well, for one thing you need to plan for higher “write” strategies. Table isolations, dedicated network connections, locking changes and even ETL changes are important to understand in this environment. But the biggest challenge is storing and backing up that much data. Or do you back it up at all? Once it’s in and processed, do you need the detail any more? All these questions have bearing on what you do, and what we do to help you visualize it and manage it.