Making Data Visually Appealing

I’ve recently been considering the graphical presentation of data. I get the feeling that we, ecologists/scientsits, could be better at data presentation. Graphs must be informative, but they don’t have to be ugly. I think that making visually appealing charts and graphs goes a long way towards making science accessible. So my posts will concentrate on making what I consider to be informative AND visually appealing graphs. That means NO bar graphs…

This first blog post will describe my recent foray into a new graphics package in R, ggplot2. Actually, ggplot2 isn’t new at all, but it’s new to me. I’ve been incredibly happy with R’s base graphics and the lattice extensions. Considering that ggplot2 required learning a whole new syntax, I never really saw the advantage. However, I’ve recently begun playing with it because I’m interested in making rank clocks (see Collins et al. 2008 in Ecology). The reason I like ggplot is that it makes very appealing graphs incredibly easily. As I’ve begun messing with it, I’ve discovered ggplot is fairly simple to get a hold of to make basic plots. Of course, like anything in R, making more complicated plots gets more, well, complicated quite quickly. I’ve been reading through the book on ggplot by Wickham. It’s very comprehensive, but also quite dense. So over the next few posts (hopefully), I’ll describe my foray into ggplot2 and what I’ve learned.

We’ll concentrate on two chart types: scatterplots (continuous x and continuous y) and boxplots (categorical x and continuous y). I like boxplots for a variety a reasons, including that fact that they are more informative, less wasteful, and prettier than bar charts, which are (in my opinion) as uninformative a graph as you can make. Don’t get me wrong, bar charts have uses, but they are limited.

SCATTERPLOT

Let’s simulate some data. We’ll make a dataframe. I make a dataframe for two main reasons: 1) Most of the time when you’re working with data, it’ll be in a

category <- rep(c('A','B','C'), each=10)
x <- runif(30, 1, 15)
y <- 5 + 2*x + rnorm(30, 0, 2)
plotData <- data.frame(category, x, y)

Here’s the basic R scatterplot with a few adjustments.

par(mar=c(4,4,1,1)+0.2)
plot(x,y,pch=16)

Rscatterbasic

This is about as basic as it gets in R. Not bad. Worthy of publication in a journal with a few more tweaks. But this isn’t going to catch any eyes. The ggplot2 scatterplot, on the other hand, will. First, let’s get a hold of the syntax. The first thing you do is make a ggplot object (I’m going to ignore the quick plot, qplot( ), command because it’s better to learn the full syntax).

p <- ggplot(plotData, aes(x,y))

This just says: “Make a ggplot object. The data is stored in plotData, variable ‘x’ is on the x-axis, and variable ‘y’ is on the y-axis’. That’s it. It has not actually specific what KIND of plot. ggplot has several kinds of plots, called geometries, or geoms, including bar (geom_bar), scatterplot (geom_point), box (geom_boxplot), and others. The intuitive thing about ggplot is that is operates in layers. Next we can say ‘take the ggplot object and put a scatterplot layer on it’.

 p + geom_point()
 

ggplot1

Already that’s better looking. We can adjust the size of the points:

 p + geom_point(size=3)
 

There are a number of other variables to adjust, but I’ll leave that to you.

Here’s where ggplot gets better. It’s very easy to group the points. In R, it takes some work, you have to specify colors manually based on the grouping factor, etc. In lattice, it’s easier, but it takes a lot of work to get the plot visually pleasing (the default colors in lattice are.. interesting).

In ggplot, all we have to do is add in a ‘color’ argument.

 p + geom_point(aes(color=category), size=3)
 

ggplotScatterColor

The color argument is put in the aes() argument because you’re updating the aesthetic of the graph. That’s a much nicer plot.

Adding in a linear regression line is as simple as adding in another geom:

 p + geom_point(aes(color=category), size=3) + geom_smooth(method='lm')
 

ggplotScatterLine

With only a tad bit of effort we have a good looking scatterplot! Granted, it could use a little dusting up, but you can see the difference between the base graphic’s and ggplot2’s defulat graphs.

BOXPLOT

Here’s R’s default boxplot:

 boxplot(y ~ category)
 

rboxplot

Again, it’s not terrible. Perfectly worthy of publication. But it’s not going to get the attention of passer-bys. Let’s do the same thing in ggplot. We’ll make a new object with ‘category’ on the x-axis and add a boxplot geom on top of it.

 p2 <- ggplot(plotData, aes(category, y))
 p2 + geom_boxplot()
 

ggplotbox

Much better! We can again add colors. Note that adding colors automatically introduces a legend, which is redundant in this case, so we tack on an extra command to get rid of it.

 p2 + geom_boxplot(aes(color=category)) + theme(legend.position='none')
 

ggplotOutline

I like it. Let’s switch it up and fill the boxplot with color instead of outlining it. The automatic fill is a little dark, so I’ll introduce some transparency with the ‘alpha’ command.

 p2 + geom_boxplot(aes(fill=category), alpha=I(0.5)) + theme(legend.position='none')
 

ggplotBoxFill

Now here is a boxplot that is super informative. It tells us the distribution of the data within each category (compare that to a bar chart). We know the high and low values and any outliers. It’s attractive, with some nice colors, so that it’ll grab the interest of people scanning through. Granted, it would still require work to become ready for publication, but it’s a very good start.

As a final note, the thing I think I find most appealing about ggplot is that it works in layers and geoms can be added on top of one another. They can be combined in any way you see fit. Layers are added in they order they are called. For example, we can make a boxplot overlain with a scatterplot (not sure why you’d do this in actuality, but it’s a good example)


p2 + geom_boxplot(aes(fill=category), alpha=I(0.5)) +
 geom_point(aes(color=category), size=3) +
 theme(legend.position='none')

geomBoxColScatt

I’ll hopefully keep posting on making graphs visually appealing. And I’ll most likely be making extensive use of ggplot2 to do it.

I’m also thinking of starting up a call for your best and most artistic graphs. Visually appealing science. I’d like to get some posts of graphs that people have made in their own research that can be considered blending art and science to grab the attention of non-scientists. I haven’t figured out the format just yet, but I’m interested in your thoughts.

About these ads

10 thoughts on “Making Data Visually Appealing

  1. Nathan,

    I share your vision that data needs to be presented in a manner that is both factually correct and visually appealing. I would also add that interactivity, where readers get to pose and answer their own questions, is another key component. The human-computer interface is where our biggest bottleneck is and the creation of engaging data graphics only solves the problem in the computer –> human direction. Making data easy to work with through intuitive interfaces helps in the human –> computer direction. When both of these are constructed, the ability of people to go from data to information to knowledge is pretty amazing.

    I’ve been trying to navigate these issues for years and encourage you to have a look at some of the databrowsers we’ve been building. Comments are always welcome.

    http://mazamascience.com/databrowsers.html

    Best of luck creating beautiful graphics that make your data both more informative and more engaging at the same time!

    Jon

    • Thanks! I agree, interactive maps that let users sort through data and make their own charts are wonderful. I, personally, could spend all day on those. I’ll check out your website in depth tomorrow, but as an example of excellent interactive maps where users can pose their own questions, check out Gapminder:

      http://www.gapminder.org/

      If you’re interested, check out this TED talk by Han Rosling, which demonstrates how beautiful, interactive graphs can be awesomely informative and awe inspiring:

    • Yep you could do that. I personally like the contrast of black lines and colored boxes because it makes it easier for me to read. But that’s just a personal preference. The nice thing about R graphics (including base, lattice, and ggplot) is that there’s always a way to do whatever you imagine.

  2. This is a big deal for me, since I work in a hybrid operations/research environment. I would like to add to the discussion with two things I have found in my experience:

    1) Good-looking data visualization is not just for non-scientists; it’s for everyone. I have been able to substantially impact both research and operations projects by displaying the data in ways that enhance the understanding of the data. I’m not just talking about colors, lines, and backgrounds. I’m talking about displaying the information in a visually interesting way that also encourages understanding of the data and what conclusions are appropriate.

    2) Transforming the data is just as important as transforming the plot. The same data can be displayed multiple ways accurately, and manual transformation is usually necessary to make an interesting plot. For instance, time durations can be assigned meaningful categories. Standing in line could be broken into categories of “no wait,” “short wait,” and “long wait” without being any less correct, but the visualizations you create could convey much more meaning to your research group or clients. “Hey, look. Window 3 processed 30 people and is all long waits, and window 5 processed the same number with short waits. Is window 3 getting a different group of people in line, or is this just a propagating delay in getting window 3 running at the beginning of the shift?” You wouldn’t see that in a boxplot or scatterplot of just wait durations.

    For reference: I am experienced in R scripting and mathematics, so what I suggest may not be appropriate for everyone. I do not use ggplot, because I script grid objects directly. This does, in some ways, change how I approach the data, so your mileage may vary.

    • I agree with you whole-heartedly. Good data visualization isn’t just the aesthetics. It’s what you present. So figuring out how to break down the data in the most understandable way is even more important than making a pretty graph. All the colors in the world won’t help make confusing data less confusing.

  3. Some sentences seem to be trimmed off? For instance right below Scatterplot it says “Let’s simulate some data. We’ll make a dataframe. I make a dataframe for two main reasons: 1) Most of the time when you’re working with data, it’ll be in a” and then just followed by the codes?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s