“You know, you look nothing like your pictures.”
Kevin Flynn, Tron
At the end of this session you should be able to
The term visual analytics refers to an interactive and computer-based analysis procedure of data sets. As a scientific key word, it is relatively new and the Thomson Reuter's Web of Knowledge shows (only) 394 entries starting in 2002. If also non-ISI journals and other sources are considered, one gets 9,640 results from Google Scholar but only 35 hits prior to 2000.
From a geographical point of view one might think that interactive and visualization-based data analysis is quite common in the field of Geographical Information Science. However, a lot of maps only show the final result of an investigation and are not actually used within the process of analysis. To illustrate this point, have a look at the well known map of [Snow1855] below:
By mapping the cholera death, snow concluded from the spatial pattern (black squares) that the Broad Street pump (~center of the outbreak) was the source of the 1854 cholera epidemic in London.
If you want to read more on the visual analytics paradigm, [Fox2011] is a good starting point. If you want to read an entire book with a variety of topic-related chapters, [Keim2010] is freely available as PDF.
Certainly, everyone of you has quite some experience in visualizing non-spatial data. In general, visualization should be guided by [Kelleher2011]
Keep these guidelines in mind when you start visualizing data.
Before we start focusing on R's basic gallery of plotting types we approach this subject with some more general examples first.
Have a look at C09-4 - Visualization (traps) now for some notes on plots, color and animations.
While not all of the examples in E06-1 should be generally avoided, there is more to think about when it comes to visualization of data sets.
Have a look at [Kelleher2011] now for some short visualization guidelines.
While you surely know a variety of different plotting types, some visualization ideas might not come to your mind since you have never seen the specific idea before. Of course, search engines are your friend but you might also have a look at e.g. flowingdata.com for some input on this subject. We noticed that web page in a presentation of Hadley Wickham, the programmer of ggplot2 who has also some nice online courses.
Before starting with plotting functions in R, just one final remark: of course there are other ways to visualize your data and you should take what ever works best for you (although it might likely be R). For an overview of visualization tools aside from R have a look at this page. For visualizing public data, you might also directly use Google's public data viewer.
R offers a large pallet of options for visualizing data. Probably the generic plotting routines from the graphics package are the most frequently used functions. For specific purposes however, especially when it comes to publication quality figures, other packages will likely be used most frequently. Above all, the lattice and the ggplo2 package will come to your attention if you look for visualization functions in this context.
As part of this course, we will focus on the generic plotting functions and also provide some help on the usage of the lattice package. Of course, all of our visualizations can also be produced with ggplot2.
The basic command structure for visualizing data using the generic functions is
<name of plotting function>(<x-axis data>, <y-axis data>,…)
while the structure for the lattice package is
<name of plotting function>(<y-axis data> ~ <x-axis data>, …)
which is not to difficult to distinguish.
For visualizing non-spatial data sets, the following overview provides you with the most important plots/functions.
Plot type | Generic plotting function | Lattice function |
---|---|---|
scatter plots | plot() | xyplot |
box and whisker plots | boxplot() | bwplot() |
histograms and density plots | hist | histogram() and densityplot() |
Have a look at C09-1 - Generic plotting functions now for more information on the generic plotting functions.
As soon as you use some kind of transformation function (e.g. log, square root) for your original data values, your axis scales in a visualization will change as a consequence. Hence, you can no longer directly read the actual value at a certain position on your axis. Fortunately, there is a simple solution to this problem. Just define your own tics (i.e. the positions at which a value is drawn on your axis) and labels (i.e. the character or numeric value which is drawn at a tic) and add them to your plot instead of the original transformed information.
Please have a look at C09-2 - Generic axis labelingfor an example on that topic using R's generic plotting functions.
Another feature you might miss so far is to add e.g. multiple lines to the same scatter plot or draw certain groups of symbols in certain colors. Fortunately, the solution for this problem is again quite straight forward. It always consists of two parts:
Finally, you may also want to add not only one but multiple plots on a single page. Again, the solution is quite simple. Just divide your page into individual grids by defining a number of rows and columns and plot an individual plot into each grid cell afterwards.
Please have a look at C09-3 - Multiple generic plots for an example on that topic using R's generic plotting functions.