Show Me The Data

One of the things I have heard from students over the years is that the data they are asked to process is not relevant or is of so little consequence that they don’t see the point in writing a program to process it. And of course how difficult is it to sort 10 items? And what is the use of searching random data created by a random number generator? Students learn best when the projects they ask a computer to do actually require computer speeds to process. So the obvious answer is to use real data. Data that comes in volume and that is interesting and/or useful to students. Fortunately there are answers. There are sources of large data sets that are often actually free. Free fits in most budgets these days.

My favorite place to start is the US Bureau of the Census. There are huge data sets there and they are often quite interesting. Are your students discussing income, poverty, and health insurance with all that in the news? Would looking at Tables of Income by Detailed Socioeconomic Characteristics be helpful? It’s there! How about state population data? There to as part of the 2009 Statistical Abstract data. Frankly if you can not find interesting data there you’re just not looking. They have a census in schools page with information and resources for both teachers and students as well. Do your social studies teachers know about this? What a way to mix social studies with math/computer science. Or sociology!

Another great source of data sets is the U.S. Bureau of Labor Statistics. Take a look at the Databases, Tables & Calculators by Subject page for starters. And don’t forget the National Center for Education Statistics as well

Now suppose you want something a little bit different. A company called AggData has some interesting data sets for sale. They have some free data sets as well. Complete List of Oscar Nominees and Winners? Free. Complete List of McDonald's Locations $49 which is still pretty reasonable. What would you do with that? Well since that data has geolocation data you could plot distances. Check out Where The Buffalo Roamed to read about someone who created a data visualization and calculated the furthest away from McDonalds you can get in the lower 48 states. The possibilities are endless.

I encourage people to think about the data they use for projects. There is data out there that can spark the imagination. More than that there is data out there that allows people to be creative and to create useful mash ups of data. Exposing students to this sort of thing young will help prepare them for topics like data mining and business intelligence analysis of data later on.

Speaking of which. For university people interested in a lot of data there is the Microsoft Enterprise Consortium

The Microsoft Enterprise Consortium is a joint program between the Department of Information Systems in the Sam M. Walton College of Business at the University of Arkansas, Fayetteville (Walton College) and Microsoft Corporation (MS). The purpose of this joint program is provide to the academic community access to large and compelling real world datasets for both teaching and research. The datasets are to support business intelligence, data mining, database instruction, and data warehousing by university faculty and researchers.

The initial large live datasets include –

  • Sam’s Club Sales Transactions Database with 6 tables and more than 55 million rows.
  • Dillard’s Department Store Sales Database with 5 tables and more than 128 million rows.
  • COPA Frozen Foods, Inc. Financials with 6 dimension tables linked to a fact table containing almost 12 million rows.