It’s About the Data

Popfly, a mashup tool, depends on three things: data that is simple to access programmatically, interesting, and available under terms that enable users to work with it. As with most software endeavors, you can pick two.

The government has a huge amount of interesting data that’s available under really great terms. Weather? Check out Financial information? Start with Crime statistics? Dig around in But how much of this is programmatically accessible? Very little, as it turns out. I’ll pick on NOAA for a little bit. They have great weather information at — enough that you can find out whether the weather this weekend is going to be good at the local fishing holes and whether the fish will be biting. But, despite the RSS feeds, the really interesting data (the forecast and the information about water conditions) is locked up in a combination of HTML, JavaScript, and GIFs. If you play with EDGAR (for information on SEC filings), you’ll find a confusing array of HTML, static XML, and .txt files.

Yes, you can program your way out of these, but it’s far too hard. Entire organizations such as the Sunlight Foundation are trying to change this, and Lawrence Lessig has proposed what he calls the Open Government Data Principles. And that’s great. But it’s not enough, because it’s not just the government.

I’ll take another example. Let’s say that you want to create an application that will check your favorite online bookstore for the books it might recommend you purchase next, and submit that list to your local library to see which books are in and maybe even offer you the ability to put one on reserve. This is an example that Jon Udell outlined something like six years ago. Unfortunately, when you think about it, the bookseller really doesn’t want you to use the local library: they want you to buy books from them. So it would be a logical extension to look at the terms of use for the booksellers APIs and see indications that scenarios that take you away from their site will be frowned upon. Of course, this makes sense to me since they’re a business, but it’s a case that the data is interesting and available, but the terms are restrictive for the scenario I’d like to build.

Oh, it’s not just the booksellers who have terms like that. Any site that makes money off of advertising, for example, is going to have holdbacks in its API terms — limits on how many calls you can make to the APIs in a given time period or how many results can be returned, or how their brand has to be shown in the resulting mashup, and so on.

As I read a Dow Jones Insight Election Pulse blog post about how much time each candidate spends talking about different issues, I thought, “There’s an interesting mashup that I would have loved to build.” But the information to create that mashup isn’t easily accessible to tools.

Why must good data be so hard to find?

Comments (0)