Tom's Ten Data Tips - Fabruary 2010
Data Visualization
Data visualization is about presenting information in "some" graphical form. The human eye is a very powerful pattern detection instrument. When you transform a table into a graph, for instance, you don't add new information, but for many people it is easier to see long-term trends and individual dips and spikes. Visualizations are an important and valuable analytic tool that enables discerning patterns in data that would otherwise be extremely difficult to find.
Computing power and database technology applied to large data sets creates high-dimensional images and animations. This leads to new insights, and the generation of hypotheses. Visualization is crucial to the discovery of patterns or relationships between data elements or facts. Non-textual representations transcend language to provide insight. There's some irony in writing about data visualization: this is inherently a topic you need to see, not read about.
1. Distinguish "Presentation" From "Visualization"
There are two different purposes for which we use visuals: one is used to transfer information or persuade a listener; the other is to support your own thinking. Both are perfectly valid uses. A presentation can be a sales pitch, or quarterly report, etc. They are for conveying information, typically to communicate or persuade. Examples of visualizations are mindmaps, geographic charts, etc. They are used to do work. A mindmap can help you organize thoughts, you use a geographic chart to determine your route.
Whereas a presentation is more reactive (the recipient "absorbs"), a visualization requires active processing of the picture. You can merely "consume" a presentation, but you need to actively engage with a visualization for it to have any use.
2. Animations Are Images With An Extra Dimension Superimposed: Time
Some advanced visualization tools can generate animations as well as images. What this does, really, is add another dimension on top of (all) the other ones: time. As such it extends tip# 4, to convey as much information as possible. Time is a strange beast, though, in that temporal separation is always required as a necessary but not sufficient condition to infer causality.
Animations work best when multiple variables change through time, especially if these changes are out of sync with each other. If the change over time in a single variable is the defining characteristic you're trying to get across, there might well be other ways (besides animations) to stress that particular point (in a visualization instead of an animation, or otherwise).
3. Plot Proportionately To Mitigate Distortion
Graphics, unfortunately, are commonly associated with deception. And indeed, many "clever" examples have been used. See for example: Huff (1993) How to Lie with Statistics. But you can also deceive with words, of course. How can you minimize the risk(s) of misinterpreting graphs? Tufte (2001) The Visual Display of Quantitative Information
provides some guidelines:
- Physical representation of numbers should be directly proportionate to numerical quantities
- Provide clear, detailed, and thorough labeling on the graphic itself
- Use consistent design throughout a graph
- Time series with money are nearly always better with deflated and standardized units
- Represent one-dimensional data in one dimension, two-dimensional data in two dimensions, etc.
- Graphics must not quote data out of context
4. Visualization Is About Reducing Dimensionality
Sometimes a research conclusion has more than two or three dimensions to it. One of my statistics teachers loved the example of how extramarital adventure impacted divorce rates, which was different for couples who did or did not engage in pre-marital sex (the latter had much higher divorce rates when confronted with adultery). You can display this in a multi-way table, but those are hard to read. Once you know how to "read" a plot (Homals in this case), you "see" the association instantly (in two dimensions). And, you can also display many more associations in the same plot. Homals is one of a "family" of statistical procedures that can be used to graphically display non-linear multi-variate data. As such it is an extension of correspondence analysis (graphically displaying tables).
5. Assessing Size Goes Better Visually
Tables have many advantages, and the ability to lookup an exact number is often a requirement that calls for a table. However, people do not read tables very easily. Tables can lead you to miss important patterns.
In particular when you want the reader to gauge relative size, the mental arithmetic required makes that hard and error prone. If you have numbers that are orders of magnitude apart, tables are downright deceptive. When a table is necessary in such cases, you probably want to combine it with a graph.
6. Less Is More (Principle Of Minimum Contrast)
Analogous to designing interfaces, for data visualization much the same principles hold. "Flashy" graphs tend to get in the way of the data, obscuring patterns you ought to be looking for. Three-dimensional barcharts as opposed to "simple" ones (usually) are a great -terrible- example. The best designs are often the simplest, most sober representations.
Now, if the depth of the bars was variable, and represented some other measure, 3D might be useful. By clearing superfluous elements from a design, the essential ingredients will stand out more prominently. The Google search bar, for instance, is more evident by all the whitespace surrounding it.
7. When You Need Numbers, They Should Be Available
One of the benefits of visualization is that looking for "patterns" masks a lot of detail. Merely an X or Y axis with few scaling ticks is sufficient to assess the order of magnitude. However, this detail should be available upon request, preferably in some easy way like hovering over a data point with your mouse. Make sure that the underlying detail is presented at the correct and appropriate granularity. "Correct granularity" means choosing 'the right' number of digits. Sometimes fractions are desirable, then settle on the number of digits. For large numbers, you may not want to display the full integer, but instead represent thousands or millions.
One application we worked with, for instance, displayed Zip-codes
(a nominal numerical variable) with two digits: 10001.00 for
8. Interactive Visualization Is OLAP Along "Different" Dimensions
Some data visualization tools allow users to interact with the image (or even animation). Instead of looking at the "complete" picture, you may choose to select only males or females, for instance, and reproduce "the same" image again for this subset of the data. Or you click on a (sub)region within a GIS, etc. This equates to "downdrilling" in an OLAP cube. Except for the interface that looks different (more intuitive?), the operation is exactly the same.
9. Good Data Visualization Software Combines Graphics With OLAP Functionality
A remarkable strength of visualization software is the ability to point out what's abnormal or unusual in the data. For the same reason these tools are excellent (and often used) for detecting outliers. Outliers are data points that are suspiciously distant from others, so much so that their inclusion as legitimate values warrants caution.
Data discovery is an iterative process. Important detail may be covered up in the aggregate. When some exceptional pattern draws attention, you'll want to "zoom in" to satisfy your curiosity. This very closely resembles slice and dice, and drill across operations in an OLAP tool (see also a previous newsletter on OLAP).
10. It's All About The Content
Data presented in graphical form should draw attention to the quantitative contents. With the rise of presentation software, incredibly rich in features and animations, there is tremendous "seduction" to liven up graphics. In the corporate world (and even in Highschool!) it has become the norm to present using tools like PowerPoint or Keynote and so a "culture" of decoration, superfluous niceties has evolved. The function of graphics is to help people reason about data. Tufte (2001) The Visual Display of Quantitative Information: "Above all else show the data."
Further reading
Some excellent books on Data Visualization:
Envisioning Information.
Edward Tufte (1990)
ISBN# 0961392118
Visual Explanations
.
Edward Tufte (1997)
ISBN# 0961392126
The Visual Display of Quantitative Evidence, 2nd Edition.
Edward Tufte (2001)
ISBN# 0961392142
Beautiful Evidence.
Edward Tufte (2006)
ISBN# 0961392177
Information Architects.
Richard Saul Wurman (1997)
ISBN# 3857094583
Now You See It: Simple Visualization Techniques.
Stephen Few (2009)
ISBN# 0970601980
Show Me The Numbers: Designing Tables and Graphs to Enlighten.
Stephen Few (2004)
ISBN# 0970601999
Information Dashboard Design: The Effective Visual Communication of Data.
Stephen Few (2006)
ISBN# 0596100167
How to Lie with Statistics.
Darrell Huff & Irving Geis (1993)
ISBN# 0393310728


