Why Do the New Orleans Saints Lose? Data Visualization II

I’m going to continue with my ‘making data visually appealing to the masses’ kick. I happen to like graphics and graphing data. I also happen to like American football (For the record, however, I’m a soccer player first, a rugby player second, an aussie rules player third, and an American football player never). Specifically, being from the area, I am a big New Orleans Saints fan. That said, they weren’t exactly lighting it up this year. In fact, for the first half of the season, they were downright horrible.

I like looking for trends in data, and I like football, so I put the two things together to see if there were potential explanations. I went through ESPN box scores and collected a few cursory statistics for each game (yards allowed, yards gained, pass attempts, rush attemps, etc). Granted, this is the most shallow analysis of football stats ever, but it’s a great vehicle for data presentation. You can find the data¬†here.

This post will continue using graphics to explore data and using ‘ggplot2′ to make visually appealing graphs. I’ve discovered ways to improve my workflow, mess with colors, and other tricks.

In my last post, I said that you use aes( ) to specify the x and y variables in the original ggplot call. For example:

p <- ggplot(SaintsStats, aes(Result, ydsAllowed))
p + geom_boxplot()

That’s useful if you know exactly what you want to graph beforehand. You can change the variables by resetting aes( ) in the geom_boxplot( ) call. There are times, like this one, during exploratory data analysis that you might want to make many graphs of many different relationships (walking a fine line between EDA and data dredging). If that’s the case, then the original call can be just the dataframe.

p <- ggplot(SaintsStats)

In this case, any geom object MUST have x and y variables specified by aes( ). For example, I’m inclined to believe that the Saints give up more yardage when they lose. I can check this with a boxplot:

p + geom_boxplot(aes(Result, ydsAllowed))

graph1

Doesn’t look to be the case. Let’s focus on the graph properties for a second. I’m down with the colors, I like them for the most part. I have a couple of tweaks I’d like to make. The axis titles and text need to be bigger, they’re hard to read. I’d like the axis text in black. I also want to put a black box around the plot panel. You can change these settings using the theme( ) command as follows:

p + geom_boxplot(aes(Result, ydsAllowed)) +
 theme(axis.text.y=element_text(size=14, color='black'),
 axis.text.x=element_text(size=14, color='black'),
 axis.title.x=element_text(size=16),
 axis.title.y=element_text(size=16),
 panel.background=element_rect(color='black'))

There are MANY elements you can control directly and each element has a large number of things you can modify. For example, text elements can be altered by size, color, font face (bold, italics), angle, etc. I’ll leave it to you to find the specifics. The end result is a much nicer graph.

graph2

I’ve noticed something else. The x- and y-axis titles are too close to the axis text. ‘ydsAllowed’ is almost right on top of the text. I want to scoot it back a little using vadjust as follows. The actual value of vadjust is something you’ll have to manually play with for your own graphics device and plot size.

p + geom_boxplot(aes(Result, ydsAllowed)) +
 theme(axis.text.y=element_text(size=14, color='black'),
 axis.text.x=element_text(size=14, color='black'),
 axis.title.x=element_text(size=16, vjust=0.2),
 axis.title.y=element_text(size=16, vjust=0.2),
 panel.background=element_rect(color='black'))

graph3Much better!

But here’s the thing. I’m lazy. I’m also convinced laziness is a great trait for a programmer because it means you look for ways to be efficient. I don’t want to have to type all of that theme( ) nonsense for every graph I make, and I’ll be making a lot because I have a number of relationships to examine. I also want these same theme settings regardless of boxplot, line graph, scatterplot, etc. You can ‘update’ the theme in ggplot to do this:

theme_old <- theme_update(
 axis.text.y=element_text(size=14, color='black'),
 axis.text.x=element_text(size=14, color='black'),
 axis.title.x=element_text(size=16, vjust=0.2),
 axis.title.y=element_text(size=16, vjust=0.2, angle=90),
 panel.background=element_rect(color='black', fill='grey90')
 )

Notice that panel.background now has the ‘fill’ argument. If you update the current theme, you’re overwriting the OLD line of code. So, essentially, you’re replacing the panel.background of the old code with the new one. If you leave out the ‘fill’ command, the theme will have a blank background (i.e. no fill). I happen to like the grey, so I stuck it in there. Any graphs you make will now have these default settings, and you can continue to use theme( ) to tweak things for specific graphs. If you want to revert to the default grey theme at any time, type theme_set(theme_grey()).

Now I can examine a large number of relationships my lazy way, with the theme already set:

p + geom_boxplot(aes(Result, ydsAllowed))
p + geom_boxplot(aes(Result, ydsGained))
p + geom_boxplot(aes(Result, ydsGained))
p + geom_boxplot(aes(Result, rushAtt))
p + geom_boxplot(aes(Result, rushYds))

What’s this? It looks like the Saints rush more when the win!

graph4That’s a great exploratory graph, but it’s nothing I want to present to the public. It’s kind of bland. I want to make a couple of changes: 1) Plot the results by week, 2) color code the points for wins and losses, change the axis labels, 4) Change the legend key to say ‘Loss’ and ‘Win’ rather than ‘L’ and ‘W’, and 5) remove the minor gridlines between weeks because they are meaningless in this case. I also want to report the percent of total plays that are rushing. I do this for a couple of reasons: 1) most people (I think) find percentages easier to grasp than proportions because the numbers are easier to understand (50% vs. 0.5), and 2) I want to remove any possible effect of increased number of offensive plays in general. Fortunately, we can transform this in the ggplot call. The full code is as follows:

p + geom_line(aes(Week, rushAtt/(passAtt+rushAtt)*100), linetype=2) + # Make the line
 geom_point(aes(Week, rushAtt/(passAtt+rushAtt)*100, fill=Result), shape=21, size=10) + # Make the points
 ylab('Percent of Rushing Play Calls') + # y-axis label
 xlab('Week') + # x-axis label
 scale_x_continuous(breaks=1:15) + # x-axis tick marks
 scale_fill_hue(labels=c('Loss', 'Win')) + # legend key definitions
 theme(
 panel.grid.minor=element_blank() # remove minor gridlines
 )

graph5

Now that’s a graph I’d be proud of. It shows many things: the weekly wins and losses, the percentage of rushing play calls, and it’s easy to see that the Saints win when there are a greater percentage of rushing plays! When the Saints rely too heavily on Brees and become one-dimensional, they lose (but see below).

Notice that, as I said above, I can update my new theme with a second theme( ) call to remove the gridlines, just as I would ordinarily. There’s also one more thing. I’m not crazy about the near-pastel colors. I want bold, eye-catching colors. You can modify the code to set the color palette or define one yourself. I prefer to use preset color palettes, and ggplot2 will accept any named palettes from the RColorBrewer package. Set1 provides bold colors that differ distinctly among levels, and would be good for discrete groups.

p + geom_line(aes(Week, rushAtt/(passAtt+rushAtt)*100), linetype=2) +
 geom_point(aes(Week, rushAtt/(passAtt+rushAtt)*100, fill=Result), shape=21, size=10) +
 ylab('Percent of Rushing Play Calls') +
 xlab('Week') +
 scale_x_continuous(breaks=1:15) +
 scale_fill_brewer(labels=c('Loss', 'Win'), palette='Set1') + # Notice the change from above to scale_fill_brewer
 theme(
 panel.grid.minor=element_blank()
 )

graph6

Voila!

Side note: I mentioned a caveat above. I can’t strictly interpret this as ‘the Saints win when they rush more’. It could just as easily be ‘the Saints rush more when they are winning’. This is also a cursory look at their stats that neglects turnovers and other crucial information (aside from the Madden-esque ‘the Saints win by outscoring the other team’ trend).

About these ads

10 thoughts on “Why Do the New Orleans Saints Lose? Data Visualization II

  1. Did you get the express written consent from Roger Goodell to use these stats? If not then he will probably suspend your cable service for all of next season so you can’t watch any football and then halfway through the season realize these data are public knowledge and he doesn’t have the right to suspend you.

    • To be honest, I don’t import xls files. I save them as .csv files and import that. I know that there is a way to import xls files via a package, but I don’t know which one because it doesn’t work on Mac, which is my OS.

      • Thank you. When I downloaded the data, it was an xls file, so I thought that was the data you imported.

    • The easiest is to use the RODBC library in R. For this particular file, you’ll need to rename the spreadsheet from [SaintsStats.csv] to [SaintsStats], because the “.” interferes with the sqlFetch command. I also find it easier to set the working directory before using RODBC, because it’s easier to debug. Note that you’ll need to replace all the Windows “\” with “/” for your folders. If you have trouble getting your code to work, make sure all of the names are correct, and make sure the xls file is saved and closed.

      # Load RODBC library.
      library(RODBC)

      # Set the working directory.
      setwd(“C:/folder name/folder name”)

      # Open a channel to the xls file.
      channel <- odbcConnectExcel("saintsstats.xls")

      # Fetch the data on the sheet [SaintsStats].
      saints_Data <- sqlFetch(channel, "SaintsStats")

      # Close the channel. VERY IMPORTANT!
      odbcClose(channel)

  2. Nathan, I like the idea you’re going for, and I think you can take it a little farther. To make the data more “visually appealing for the masses” you need to offload part of the burden of interpretation. Chronological ordering of points makes sense for control charts, but control charts are essentially univariate. What you have here is two variables: [rush call %] and [win/loss]. To see the relationship, you have to strain pretty hard with the current graph. Perhaps a simple reordering based on [rush call %] would illustrate your hypothesis much more quickly. Heck, if you do it that way, it can even be seen in a table:

    Win/Loss Run call %
    W 49%
    W 48%
    W 48%
    W 42%
    W 41%
    W 39%
    L 38%
    L 36%
    L 36%
    L 35%
    L 34%
    W 32%
    L 32%
    L 29%
    L 16%

    In my line of work, I find that making data readable and understandable is maybe 20% aesthetics. The rest of it is:
    a) understanding the data [20%]
    b) conceptualizing a visual [5%]
    c) reshaping the data into a usable form for the chosen visual [45%]
    d) scripting up the visual [10%]

    I guess what I’m trying to say is, “Don’t let yourself be constrained by your data!”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s