Data Visualization Basics for Data Scientists


“A picture is worth a thousand words”, the old saying goes, and in some cases a picture is worth even more than that. The human eye is composed of some 30 or more discrete components, and along with the optical nerves and the brain functions that process sight, can take in a contrast ratio of around 100,000:1 (over time) and can distinguish about 10 million colors. That sight-brain-pathway is a pattern-matching wonder, and has “regions of interest” that the eye/brain connection focuses on (http://www.cambridgeincolour.com/tutorials/cameras-vs-human-eye.htm).

Making up one of our primary senses, sight is immeasurably important to conveying information, and it’s vital to the Data Scientist to understand how to best use various visualizations to display and discuss data.

And there are two main reasons we do this: to explore data, and to explaindata.

Visualizations to Explore Data

When you’re working with data and attempting to find an answer, there is sometimes no better way to “zero in” on what you want to find than by using a graphic. As you work with your data and computations, include graphics in strategic phases of your analysis to ensure you’re on the right track. Look up from the data from time to time and see where you are.

The graphics used in this application for the Data Scientist tend to be direct, simplistic, and focus on contrast and variance. For instance, if you were to look at a chart of thousands of numbers, it might be difficult to spot a trend – but a chart does that well:

R-Install-6

The same holds true for finding differences in numerical sets, the size of categories in multiple sets, or relationships between sets.

When the Data Scientist uses graphics to explore data, they usually stick with what comes “in the box” of the tool they are using – Excel, R, Python, and so on. Each data computation software package usually has at least some rudimentary graphing function, so that you can check the results of a calculation or grouping. I’ll blog about each of those in other posts.

These graphics form the basis of the next use for graphics in Data Science – explaining the data.

Visualizations to Explain Data

The process for Data Science moves from properly acquiring, cleaning, processing, evaluating and predicting from data through presenting it. Data Scientists must be well-versed in communicating complex processes and results to an audience that doesn’t always have the same background in statistics, math or data informatics. Graphics are a great way to do that. Analysis and prediction are pointless if not used, and the results are most often used by others, not you.

Two comments here. First, you need to “tart up” your visualizations. We live in an age of continuous partial attention – I once secretly timed a senior executive I was talking to and on average, I had 20 seconds of her attention at a time before her pupil response showed she was off to another task. Our smart-phone-less-engaged-people environment means that the graphics need to be large, information-dense, and effectively convey the meaning of the data results you want to show. It does NOT mean you should focus on style over substance – quite the opposite.

Visualization Basics

There are hundreds of resources you can find on creating effective visuals, so I’ll leave the advanced graphics topics to those. As a Data Scientist, you’ll want to ensure you have the basics nailed down well. I’ll focus on just three here: focusing on your goal, using the right graphic type, and showing the data in a different way.

Keep the goals for the visualization in mind

As you create your graphics, either for exploration or for explanation, make sure you keep these questions in mind, taped to the monitor as you develop them – all should be answered “yes”:

  • Does this graphic show the point I am trying to make?
  • Does it contain all the information it needs to make that point?
  • Is there anything in the graphic that doesn’t need to be there? (also called Chartjunk)
  • Am I using color and contrast effectively? (you need both, in case your audience can’t distinguish the colors)
  • Am I using the right amount of space? Can the eye take in the information in a single pass?

Use the appropriate graphic for your data type

I don’t have enough room to cover this basic concept properly (check out this book for more: http://www.amazon.com/Street-Journal-Guide-Information-Graphics/dp/0393347281/ref=pd_sim_14_2?ie=UTF8&dpID=51dTlBgg96L&dpSrc=sims&preST=_AC_UL160_SR121%2C160_&refRID=0MPTC8Z7ANAFZH0FX1R9) but in general, you use a line type chart to show trends and time, bar charts for ranking and comparison, pie charts for ratios, scatterplots for correlation and centering, histograms for frequency distributions, and of course maps for geographical data.

Again, a full treatment of this topic would take a book, so get one of those and then get another one. And another one.

Make the visualization show what the data process doesn’t

This is the one tip I would recommend to any aspiring Data Scientist. If the chart shows just a few numbers, just use the chart. Most people can handle 5-10 numbers in a sequence. The entire point of visualization is to re-present data in a more understandable format, so focus on that as a goal. The simplest way to do that is to ask yourself “What did I have to do to get to my point?” and then show the results of the point – not the process. I’ve seen data professionals lose an audience’s attention with dozens of graphics and infographics that don’t get to the point.

One final note

Not everyone can see. Not everyone has the same kind or level of vision. When you’re done, convert your graphics *back* into a sentence or paragraph describing your point. In fact, I often do this first, and then create the graphic. Consider the differences in your audience, and accommodate them. It’s extra work, and it’s worth it.


Comments (0)

Skip to main content