The Petabyte Problem: Scrubbing, Curating and Publishing Big Data

Carol Minton Morris has a good entry on the HatCheck Newsletter on Alex Szalay's keynote, “Scientific Publishing in the Era of Pedabyte Data,” at JCDL on June 19, 2008.  I always enjoy listening to Alex and hearing his perspective, especially since he gets his hands dirty with the eScience work and has  put together a really good team.  and of course there is always the fact that he plays the lead guitar :-).

The Petabyte Problem: Scrubbing, Curating and Publishing Big Data

He suggests that the there is a science project pyramid–single lab at the base, multi-campus in the center, and international consortia on top. Often a scientific discipline will recognize the need for a major “giga” initiatives such as supercomputing research that is highly collaborative and distributed. The output from these efforts at every scale contain:

–Literature

–Derived and re-combined data

–Raw data

Szalay would like to see a continous feedback loop among these three aspects where data and analysis are always updating.

To answer the question, “How can you publish data so that others might recreate your results in 100 yrs.,” he referred to Gray’s laws of Data Engineering: scientific computing revolves around data; scale-out the solution for analysis; take the analysis to the data; start with 20 queries, and; go from working to working.

HatCheck Newsletter » Blog Archive » The Petabyte Problem: Scrubbing, Curating and Publishing Big Data

Cross Posted from Dan Fay's Blog (https://blogs.msdn.com/dan\_fay)