I commonly get asked where to find data. For most Business Analysts, your best data sources are found within your enterprise, i.e. operational data in operational databases ideally transformed for easier, more consistent consumption in a data mart or data warehouse and exhaust data housed in a data lake. Locating these data can be tricky but ask around and you should be able to identify where this data resides and start the process of obtaining access to it.
Outside your enterprise, there exists numerous sources of potentially useful data. Some of this is free but the best stuff is not. In this post, I want to compile a list of potentially useful data sources based on my personal and therefore limited experience. If you have suggestions for other useful data sources, please leave a comment and I will consider adding it here.
That said, I am not interested in academic data sets, e.g. iris petal and sepal lengths, which are commonly used by analysts to learn analytic techniques. Most analytics tools will provide you quick access to this stuff. I want to limit this list to data you might actually use to answer a business question.
Let's start with the for-fee data sets. These are typically built using a combination of public and proprietary data. These data sets are curated by private companies. There quality is generally good but they may include derived information, the accuracy or applicability of which may be questioned. Still, these tend to be some of the best quality data sets available to analysts:
- Nielsen (Consumer Data)
- IRI (Consumer Data)
- Acxiom (Consumer & Online Data)
- BlueKai (Online Data)
- Melissa Data (Consumer Data)
- Dun & Bradstreet (Business Data)
- OnTerra Systems (Geographical Data)
- Trimble (Geographical Data)
This list is very small compared to what is actually available. More and more businesses are starting to deliver their data as a service to businesses. You may find that searching the websites of companies you suspect may have interesting data can help you uncover quite a number of resources.
Web & Social Media Data
Many web and social media sites provide APIs, against which developers may build a variety of applications. The Programmable Web provides a registry for a very large number of these APIs though you might try trying to connect directly to the documentation for these by going to dev., developer. or developers. and then the website name; this is kind of the normal location of this kind of thing. (The Pintrest API, frustratingly doesn't follow this norm and is not registered with the Programmable Web.)
Using these APIs can get very technical for most Business Analysts so that you might need to engage a developer to help you retrieve the data you are looking for. Also, READ THE TERMS AND CONDITIONS associated with these APIs very carefully to ensure that how you are using the API and the extracted data is appropriate. Folks have been known to sue when this data isn't used appropriately.
Finally, be aware that organizations throttle the use of their APIs. You need to be mindful of this lest you find yourself locked out. If you to consume data at a faster rate than the API can support, dig around to see if you can find a clearinghouse associated with the site. Twitter for example provides full content access through GNip (for a fee).
More and more government agencies are providing access to their data in the interest of transparency. In the US, data.gov provides a clearing house for many of these data sets. Other jurisdictions may post their data to similarly named web sites, e.g. the State of Texas provides data access at data.texas.gov. Still others bury their data in poorly designed government websites. The Socrata web site provides a good starting point for locating these data.
Returning to the US federal-level, I have found that going to specific agency websites to locate data can be easier than using data.gov which often has a bunch of broken links. Specific sites I have found helpful include:
Before focusing too much time on government data, please keep in mind that these data sets do not typically contain granular information. Instead, these data sets often represent summarized subsets of data as found in a specific report. The structure of these data can also be a bit jacked-up as in "who in their right mind would organize a data set like this?!" Complex data formats such as XML and JSON may also be problematic. Quite often, you will find yourself exerting a lot of effort to prep this data in order to answer a limited number of questions. You've been warned 🙂
A number of cloud service providers are attempting to solve the problem of finding both public and syndicated data. To this end, they have setup marketplaces where you can search for, consume and, if needed, pay for data access in a more consistent manner. Here are a few of these:
Other Interesting Places to Look for Data
Most of the data sets I've listed in this post are data sets you could see yourself going back to on a regular basis. Every now and then, you just need an interesting one-off data set in order to demonstrate a concept or prove a point. I have found these to be interesting starting points for finding this kind of data: