Why Do the New Orleans Saints Lose? Data Visualization II

I’m going to continue with my ‘making data visually appealing to the masses’ kick. I happen to like graphics and graphing data. I also happen to like American football (For the record, however, I’m a soccer player first, a rugby player second, an aussie rules player third, and an American football player never). Specifically, being from the area, I am a big New Orleans Saints fan. That said, they weren’t exactly lighting it up this year. In fact, for the first half of the season, they were downright horrible.

I like looking for trends in data, and I like football, so I put the two things together to see if there were potential explanations. I went through ESPN box scores and collected a few cursory statistics for each game (yards allowed, yards gained, pass attempts, rush attemps, etc). Granted, this is the most shallow analysis of football stats ever, but it’s a great vehicle for data presentation. You can find the data¬†here.

This post will continue using graphics to explore data and using ‘ggplot2’ to make visually appealing graphs. I’ve discovered ways to improve my workflow, mess with colors, and other tricks.

In my last post, I said that you use aes( ) to specify the x and y variables in the original ggplot call. For example:

p <- ggplot(SaintsStats, aes(Result, ydsAllowed))
p + geom_boxplot()

That’s useful if you know exactly what you want to graph beforehand. You can change the variables by resetting aes( ) in the geom_boxplot( ) call. There are times, like this one, during exploratory data analysis that you might want to make many graphs of many different relationships (walking a fine line between EDA and data dredging). If that’s the case, then the original call can be just the dataframe.

p <- ggplot(SaintsStats)

In this case, any geom object MUST have x and y variables specified by aes( ). For example, I’m inclined to believe that the Saints give up more yardage when they lose. I can check this with a boxplot:

p + geom_boxplot(aes(Result, ydsAllowed))

graph1

Doesn’t look to be the case. Let’s focus on the graph properties for a second. I’m down with the colors, I like them for the most part. I have a couple of tweaks I’d like to make. The axis titles and text need to be bigger, they’re hard to read. I’d like the axis text in black. I also want to put a black box around the plot panel. You can change these settings using the theme( ) command as follows:

p + geom_boxplot(aes(Result, ydsAllowed)) +
 theme(axis.text.y=element_text(size=14, color='black'),
 axis.text.x=element_text(size=14, color='black'),
 axis.title.x=element_text(size=16),
 axis.title.y=element_text(size=16),
 panel.background=element_rect(color='black'))

There are MANY elements you can control directly and each element has a large number of things you can modify. For example, text elements can be altered by size, color, font face (bold, italics), angle, etc. I’ll leave it to you to find the specifics. The end result is a much nicer graph.

graph2

I’ve noticed something else. The x- and y-axis titles are too close to the axis text. ‘ydsAllowed’ is almost right on top of the text. I want to scoot it back a little using vadjust as follows. The actual value of vadjust is something you’ll have to manually play with for your own graphics device and plot size.

p + geom_boxplot(aes(Result, ydsAllowed)) +
 theme(axis.text.y=element_text(size=14, color='black'),
 axis.text.x=element_text(size=14, color='black'),
 axis.title.x=element_text(size=16, vjust=0.2),
 axis.title.y=element_text(size=16, vjust=0.2),
 panel.background=element_rect(color='black'))

graph3Much better!

But here’s the thing. I’m lazy. I’m also convinced laziness is a great trait for a programmer because it means you look for ways to be efficient. I don’t want to have to type all of that theme( ) nonsense for every graph I make, and I’ll be making a lot because I have a number of relationships to examine. I also want these same theme settings regardless of boxplot, line graph, scatterplot, etc. You can ‘update’ the theme in ggplot to do this:

theme_old <- theme_update(
 axis.text.y=element_text(size=14, color='black'),
 axis.text.x=element_text(size=14, color='black'),
 axis.title.x=element_text(size=16, vjust=0.2),
 axis.title.y=element_text(size=16, vjust=0.2, angle=90),
 panel.background=element_rect(color='black', fill='grey90')
 )

Notice that panel.background now has the ‘fill’ argument. If you update the current theme, you’re overwriting the OLD line of code. So, essentially, you’re replacing the panel.background of the old code with the new one. If you leave out the ‘fill’ command, the theme will have a blank background (i.e. no fill). I happen to like the grey, so I stuck it in there. Any graphs you make will now have these default settings, and you can continue to use theme( ) to tweak things for specific graphs. If you want to revert to the default grey theme at any time, type theme_set(theme_grey()).

Now I can examine a large number of relationships my lazy way, with the theme already set:

p + geom_boxplot(aes(Result, ydsAllowed))
p + geom_boxplot(aes(Result, ydsGained))
p + geom_boxplot(aes(Result, ydsGained))
p + geom_boxplot(aes(Result, rushAtt))
p + geom_boxplot(aes(Result, rushYds))

What’s this? It looks like the Saints rush more when the win!

graph4That’s a great exploratory graph, but it’s nothing I want to present to the public. It’s kind of bland. I want to make a couple of changes: 1) Plot the results by week, 2) color code the points for wins and losses, change the axis labels, 4) Change the legend key to say ‘Loss’ and ‘Win’ rather than ‘L’ and ‘W’, and 5) remove the minor gridlines between weeks because they are meaningless in this case. I also want to report the percent of total plays that are rushing. I do this for a couple of reasons: 1) most people (I think) find percentages easier to grasp than proportions because the numbers are easier to understand (50% vs. 0.5), and 2) I want to remove any possible effect of increased number of offensive plays in general. Fortunately, we can transform this in the ggplot call. The full code is as follows:

p + geom_line(aes(Week, rushAtt/(passAtt+rushAtt)*100), linetype=2) + # Make the line
 geom_point(aes(Week, rushAtt/(passAtt+rushAtt)*100, fill=Result), shape=21, size=10) + # Make the points
 ylab('Percent of Rushing Play Calls') + # y-axis label
 xlab('Week') + # x-axis label
 scale_x_continuous(breaks=1:15) + # x-axis tick marks
 scale_fill_hue(labels=c('Loss', 'Win')) + # legend key definitions
 theme(
 panel.grid.minor=element_blank() # remove minor gridlines
 )

graph5

Now that’s a graph I’d be proud of. It shows many things: the weekly wins and losses, the percentage of rushing play calls, and it’s easy to see that the Saints win when there are a greater percentage of rushing plays! When the Saints rely too heavily on Brees and become one-dimensional, they lose (but see below).

Notice that, as I said above, I can update my new theme with a second theme( ) call to remove the gridlines, just as I would ordinarily. There’s also one more thing. I’m not crazy about the near-pastel colors. I want bold, eye-catching colors. You can modify the code to set the color palette or define one yourself. I prefer to use preset color palettes, and ggplot2 will accept any named palettes from the RColorBrewer package. Set1 provides bold colors that differ distinctly among levels, and would be good for discrete groups.

p + geom_line(aes(Week, rushAtt/(passAtt+rushAtt)*100), linetype=2) +
 geom_point(aes(Week, rushAtt/(passAtt+rushAtt)*100, fill=Result), shape=21, size=10) +
 ylab('Percent of Rushing Play Calls') +
 xlab('Week') +
 scale_x_continuous(breaks=1:15) +
 scale_fill_brewer(labels=c('Loss', 'Win'), palette='Set1') + # Notice the change from above to scale_fill_brewer
 theme(
 panel.grid.minor=element_blank()
 )

graph6

Voila!

Side note: I mentioned a caveat above. I can’t strictly interpret this as ‘the Saints win when they rush more’. It could just as easily be ‘the Saints rush more when they are winning’. This is also a cursory look at their stats that neglects turnovers and other crucial information (aside from the Madden-esque ‘the Saints win by outscoring the other team’ trend).

Making Data Visually Appealing

I’ve recently been considering the graphical presentation of data. I get the feeling that we, ecologists/scientsits, could be better at data presentation. Graphs must be informative, but they don’t have to be ugly. I think that making visually appealing charts and graphs goes a long way towards making science accessible. So my posts will concentrate on making what I consider to be informative AND visually appealing graphs. That means NO bar graphs…

This first blog post will describe my recent foray into a new graphics package in R, ggplot2. Actually, ggplot2 isn’t new at all, but it’s new to me. I’ve been incredibly happy with R’s base graphics and the lattice extensions. Considering that ggplot2 required learning a whole new syntax, I never really saw the advantage. However, I’ve recently begun playing with it because I’m interested in making rank clocks (see Collins et al. 2008 in Ecology). The reason I like ggplot is that it makes very appealing graphs incredibly easily. As I’ve begun messing with it, I’ve discovered ggplot is fairly simple to get a hold of to make basic plots. Of course, like anything in R, making more complicated plots gets more, well, complicated quite quickly. I’ve been reading through the book on ggplot by Wickham. It’s very comprehensive, but also quite dense. So over the next few posts (hopefully), I’ll describe my foray into ggplot2 and what I’ve learned.

We’ll concentrate on two chart types: scatterplots (continuous x and continuous y) and boxplots (categorical x and continuous y). I like boxplots for a variety a reasons, including that fact that they are more informative, less wasteful, and prettier than bar charts, which are (in my opinion) as uninformative a graph as you can make. Don’t get me wrong, bar charts have uses, but they are limited.

SCATTERPLOT

Let’s simulate some data. We’ll make a dataframe. I make a dataframe for two main reasons: 1) Most of the time when you’re working with data, it’ll be in a

category <- rep(c('A','B','C'), each=10)
x <- runif(30, 1, 15)
y <- 5 + 2*x + rnorm(30, 0, 2)
plotData <- data.frame(category, x, y)

Here’s the basic R scatterplot with a few adjustments.

par(mar=c(4,4,1,1)+0.2)
plot(x,y,pch=16)

Rscatterbasic

This is about as basic as it gets in R. Not bad. Worthy of publication in a journal with a few more tweaks. But this isn’t going to catch any eyes. The ggplot2 scatterplot, on the other hand, will. First, let’s get a hold of the syntax. The first thing you do is make a ggplot object (I’m going to ignore the quick plot, qplot( ), command because it’s better to learn the full syntax).

p <- ggplot(plotData, aes(x,y))

This just says: “Make a ggplot object. The data is stored in plotData, variable ‘x’ is on the x-axis, and variable ‘y’ is on the y-axis’. That’s it. It has not actually specific what KIND of plot. ggplot has several kinds of plots, called geometries, or geoms, including bar (geom_bar), scatterplot (geom_point), box (geom_boxplot), and others. The intuitive thing about ggplot is that is operates in layers. Next we can say ‘take the ggplot object and put a scatterplot layer on it’.

 p + geom_point()
 

ggplot1

Already that’s better looking. We can adjust the size of the points:

 p + geom_point(size=3)
 

There are a number of other variables to adjust, but I’ll leave that to you.

Here’s where ggplot gets better. It’s very easy to group the points. In R, it takes some work, you have to specify colors manually based on the grouping factor, etc. In lattice, it’s easier, but it takes a lot of work to get the plot visually pleasing (the default colors in lattice are.. interesting).

In ggplot, all we have to do is add in a ‘color’ argument.

 p + geom_point(aes(color=category), size=3)
 

ggplotScatterColor

The color argument is put in the aes() argument because you’re updating the aesthetic of the graph. That’s a much nicer plot.

Adding in a linear regression line is as simple as adding in another geom:

 p + geom_point(aes(color=category), size=3) + geom_smooth(method='lm')
 

ggplotScatterLine

With only a tad bit of effort we have a good looking scatterplot! Granted, it could use a little dusting up, but you can see the difference between the base graphic’s and ggplot2’s defulat graphs.

BOXPLOT

Here’s R’s default boxplot:

 boxplot(y ~ category)
 

rboxplot

Again, it’s not terrible. Perfectly worthy of publication. But it’s not going to get the attention of passer-bys. Let’s do the same thing in ggplot. We’ll make a new object with ‘category’ on the x-axis and add a boxplot geom on top of it.

 p2 <- ggplot(plotData, aes(category, y))
 p2 + geom_boxplot()
 

ggplotbox

Much better! We can again add colors. Note that adding colors automatically introduces a legend, which is redundant in this case, so we tack on an extra command to get rid of it.

 p2 + geom_boxplot(aes(color=category)) + theme(legend.position='none')
 

ggplotOutline

I like it. Let’s switch it up and fill the boxplot with color instead of outlining it. The automatic fill is a little dark, so I’ll introduce some transparency with the ‘alpha’ command.

 p2 + geom_boxplot(aes(fill=category), alpha=I(0.5)) + theme(legend.position='none')
 

ggplotBoxFill

Now here is a boxplot that is super informative. It tells us the distribution of the data within each category (compare that to a bar chart). We know the high and low values and any outliers. It’s attractive, with some nice colors, so that it’ll grab the interest of people scanning through. Granted, it would still require work to become ready for publication, but it’s a very good start.

As a final note, the thing I think I find most appealing about ggplot is that it works in layers and geoms can be added on top of one another. They can be combined in any way you see fit. Layers are added in they order they are called. For example, we can make a boxplot overlain with a scatterplot (not sure why you’d do this in actuality, but it’s a good example)


p2 + geom_boxplot(aes(fill=category), alpha=I(0.5)) +
 geom_point(aes(color=category), size=3) +
 theme(legend.position='none')

geomBoxColScatt

I’ll hopefully keep posting on making graphs visually appealing. And I’ll most likely be making extensive use of ggplot2 to do it.

I’m also thinking of starting up a call for your best and most artistic graphs. Visually appealing science. I’d like to get some posts of graphs that people have made in their own research that can be considered blending art and science to grab the attention of non-scientists. I haven’t figured out the format just yet, but I’m interested in your thoughts.