Shifting Baselines and the New Normal: Visual Evidence of Climate Change

I recently read this post by Randall Munroe. Needless to say I was inspired, both by his message and technique. The media hypes up the ‘polar vortex’ and shows images of the United Sates freezing from Miami to Minnesota. Unfortunately, this is all too often seized upon as evidence that climate change (i.e. global warming) isn’t real. The fact is, the extreme temperatures we saw this January used to be normal. They seem extreme now because they just don’t happen that often any more.

What do I mean? Well, let’s look at Chicago’s historical weather data I downloaded from a NOAA weather station.

Brrrr? Well.. not really

Brrrr? Well.. not really

The number of days with a low temperature below 0˚F as fallen dramatically since the 1950’s and 1960’s (plot on the left). What’s more, the number of days with a HIGH temperature below 0˚F has fallen from around 35 frigid days per year to next to none. The polar vortex seems extreme now, in 2014, but would have been just another January day in 1950’s Chicago.

That’s cold, and people in sunny Florida, like me, might not be able to fathom such extreme lows. What I can fathom, however, is extreme heat. Unfortunately that is becoming more and more common in Miami.

miami

The number of days with a low temperature above 70˚F is increasing rapidly. Used to be that low temperatures were above 70 for only 50% of the year. Now, they’re above 70˚ 66% of the year. Similarly, the number of days with a high above 90˚F is climbing steadily (aside from 1951, which may have had an anomalous heat wave). Bad news for snowbirds, they may have to start heading to Cuba or Jamaica to find the pleasant temperatures that popularized Miami as a vacation destination and winter retreat during the mid-20th century.

Because climate change is slow, our perception of normal shifts with it. Nowadays, the polar vortex is seen as a freak weather incidence, when that used to be the norm. Blistering hot days in Miami are now the status quo, when they used to be much rarer. Climate change requires a historical perspective, otherwise.. we’re in trouble.

PyStan: A Basic Tutorial of Bayesian Data Analysis in Python

I’ve been waiting for PyStan for a little while, and it has arrived. The STAN team has posted great examples of the STAN modeling language and very brief examples of how to run PyStan. However, there examples are brief and stop just after model fitting. They do not include the full runthrough of plotting predictions, traceplots of specific parameters, etc. So I thought I’d do a blog post of basic linear regression in PyStan, detailing how to code the model, fit the model, check the model, and plot the results. The whole shebang. So without further adieu:

# module import
import pystan
import numpy as np
import pylab as py
import pandas as pd

## data simulation
x = np.arange(1, 100, 5)
y = 2.5 + .5 * x + np.random.randn(20) * 10

# get number of observations
N = len(x)

# plot the data
py.plot(x,y, 'o')
py.show()

# STAN model (this is the most important part)
regress_code = """
data {
 int<lower = 0> N; // number of observations
 real y[N]; // response variable
 real x[N]; // predictor variable
}
parameters {
 real a; // intercept
 real b; // slope
 real<lower=0> sigma; // standard deviation
}
transformed parameters {
 real mu[N]; // fitted values

for(i in 1:N)
 mu[i] <- a + b*x[i];
}
model {
 y ~ normal(mu, sigma);
}
"""

# make a dictionary containing all data to be passed to STAN
regress_dat = {'x': x,
 'y': y,
 'N': N}

# Fit the model
fit = pystan.stan(model_code=regress_code, data=regress_dat,
 iter=1000, chains=4)

# model summary
print fit

# show a traceplot of ALL parameters. This is a bear if you have many
fit.traceplot()
py.show()

# Instead, show a traceplot for single parameter
fit.plot(['a'])
py.show()

##### PREDICTION ####

# make a dataframe of parameter estimates for all chains
params = pd.DataFrame({'a': fit.extract('a', permuted=True), 'b': fit.extract('b', permuted=True)})

# next, make a prediction function. Making a function makes every step following this 10 times easier
def stanPred(p):
 fitted = p[0] + p[1] * predX
 return pd.Series({'fitted': fitted})

# make a prediction vector (the values of X for which you want to predict)
predX = np.arange(1, 100)

# get the median parameter estimates
medParam = params.median()
# predict
yhat = stanPred(medParam)

# get the predicted values for each chain. This is super convenient in pandas because
# it is possible to have a single column where each element is a list
chainPreds = params.apply(stanPred, axis = 1)

## PLOTTING

# create a random index for chain sampling
idx = np.random.choice(1999, 50)
# plot each chain. chainPreds.iloc[i, 0] gets predicted values from the ith set of parameter estimates
for i in range(len(idx)):
 py.plot(predX, chainPreds.iloc[idx[i], 0], color='lightgrey')

# original data
py.plot(x, y, 'ko')
# fitted values
py.plot(predX, yhat['fitted'], 'k')

# supplementals
py.xlabel('X')
py.ylabel('Y')

py.show()

This yields the following:

stan_chains

Instead of showing chains (which can make a messy, hard to read plot in some cases), we can show a shaded credible interval region:

# make a function that iterates over every predicted values in every chain and returns the quantiles. For example:

def quantileGet(q):
    # make a list to store the quantiles
    quants = []

    # for every predicted value
    for i in range(len(predX)):
        # make a vector to store the predictions from each chain
        val = []

        # next go down the rows and store the values
        for j in range(chainPreds.shape[0]):
            val.append(chainPreds['fitted'][j][i])

        # return the quantile for the predictions.
        quants.append(np.percentile(val, q))

    return quants

# NOTE THAT NUMPY DOES PERCENTILES, SO MULTIPLE QUANTILE BY 100
# 2.5% quantile
lower = quantileGet(2.5)
#97.5
upper = quantileGet(97.5)

# plot this
fig = py.figure()
ax = fig.add_subplot(111)

# shade the credible interval
ax.fill_between(predX, lower, upper, facecolor = 'lightgrey', edgecolor = 'none')
# plot the data
ax.plot(x, y, 'ko')
# plot the fitted line
ax.plot(predX, yhat['fitted'], 'k')

# supplementals
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.grid()

py.show()

stan_quants

And there you have it! Data analysis in Python just got a whole lot better.

Recommended Setting Changes for Python Graphs

The beautiful thing about matplotlib is that it is infinitely customizable and has great default settings. However, it is worth looking at the matplotlibrc file to change some of the default settings. The file is heavily and wonderfully annotated, so you know exactly what you’re changing as you do it. I played around with it a bit and settled on what I like as my new default settings. I mostly changed text sizes and the widths of the axes (which I felt were too thin).

Here’s what I recommend:

  1. figure.figsize   : 6.5, 5.5    – I like the figure size of 6.5 x 5.5. I don’t know why. It’s a holdover from how I made R graphs and I’ve gotten used to it
  2. legend.numpoints     : 1    – get rid of the double points in pylab’s legend. I have no idea why the default is 2. It always looked funny.
  3. legend.fontsize      : 12 – reduces the legend font size. I found that the legend was taking up most of my plot
  4. legend.handletextpad : 0.2 – reduce the white space between legend points and legend text

  5. ytick.major.width : 1 – increase the thickness of the tick lines
    ytick.minor.width : 1 – same as above
  6. xtick.major.width : 1 -again
    xtick.minor.width : 1
  7. axes.axisbelow      : True – puts the grid BELOW the elements (rather than overtop of them)
  8. axes.linewidth      : 1.5    – gives the axes a thicker line. I think this makes the graph pop a little more

If you save images as PNG or JPG, it’s also worth changing the DPI to something higher (the default is 50). Since I use PDF, this isn’t really an issue.

Here’s the final result:

python_japenese

Now, if only I could get the TeX font not to look ugly as sin and figure out how to make minor ticks turn on by default..

ggplot2 in Python: A major barrier broken

I have been working with Python recently and I have to say, I love it. There’s a learning curve, of course, which has been frustrating. However, once I got comfortable with it (and continue to do so), I found that working with dataframes in Python (via pandas) is easy, fast, and can do some awesome things pretty simple manner. That’s a subject for another time.

One other thing (among many) that I like about Python is matplotlib, the plotting module. It makes great plots in very few lines of code. It’s default settings are great, so it doesn’t take a whole lot of tweaking to make a publishable plot. It is also very intuitive.

There are two major barriers that stop me from adopting Python all out. First is the infancy of its stats packages, and since I do mostly data analysis as opposed to modeling this is a problem. Python has a large number of basic stats modules, including some that allow a formula interface akin to R. However, it lacks things like statistical non-linear regression (beyond simple curve fitting) and mixed effects models. Given the rate of development in Python, I doubt that this problem will last much longer. R’s stats functions are much more numerous and under much more rapid development for specialized applications like animal movement, species distribution modeling, econometrics, etc. However, the recent implementation of Bayesian modelling via STAN into Python (as Pystan) by Andrew Gelman and his team has removed the major issue, as I can now do almost any test (linear, non-linear, multi-level) that I can write a Bayesian MCMC model (more on this later, because Pystan is awesome but still young. In particular, implementing mcmcplots for stan fits would be amazing).

The second major barrier was trellis plots. R has ggplot2, which makes trellis plotting ludicrously simple. It also allows me to plot summary functions (like means and S.E.) without having to actually aggregate these values out. This isn’t a problem on a simple plot but it rapidly becomes cumbersome on a multi-panel plot wherein these summary statistics need to be calculated for each panel. ggplot2 allows me to circumvent this. There is an active port of ggplot2 to Python ongoing, but it too is still young and many functions are incomplete (i.e. boxplots). The good news is that I’ve discovered rpy2, which allows me to call R functions (like ggplot!) in Python. I can do all of my data sorting, stats, etc. in Python, then use rpy2/ggplot to make the plots.

A short example:

# Import the necessary modules
import numpy as np
import pandas as pd
import rpy2.robjects as robj
import rpy2.robjects.pandas2ri # for dataframe conversion
from rpy2.robjects.packages import importr

# First, make some random data
x = np.random.normal(loc = 5, scale = 2, size = 10)
y = x + np.random.normal(loc = 0, scale = 2, size = 10)

# Make these into a pandas dataframe. I do this because
# more often than not, I read in a pandas dataframe, so this
# shows how to use a pandas dataframe to plot in ggplot
testData = pd.DataFrame( {'x':x, 'y':y} )
# it looks just like a dataframe from R
print testData

# Next, you make an robject containing function that makes the plot.
# the language in the function is pure R, so it can be anything
# note that the R environment is blank to start, so ggplot2 has to be
# loaded
plotFunc = robj.r("""
 library(ggplot2)

function(df){
 p <- ggplot(df, aes(x, y)) +
 geom_point( )

print(p)
 }
""")

# import graphics devices. This is necessary to shut the graph off
# otherwise it just hangs and freezes python
gr = importr('grDevices')

# convert the testData to an R dataframe
robj.pandas2ri.activate()
testData_R = robj.conversion.py2ri(testData)

# run the plot function on the dataframe
plotFunc(testData_R)

# ask for input. This requires you to press enter, otherwise the plot
# window closes immediately
raw_input()

# shut down the window using dev_off()
gr.dev_off()

# you can even save the output once you like it
plotFunc_2 = robj.r("""
 library(ggplot2)

function(df){
 p <- ggplot(df, aes(x, y)) +
 geom_point( ) +
 theme(
 panel.background = element_rect(fill = NA, color = 'black')
 )

ggsave('rpy2_magic.pdf', plot = p, width = 6.5, height = 5.5)
 }
""")

plotFunc_2(testData_R)

rpy2_magic

So there you have it. Python now has the capability to access ggplot2, although it can be a bit cumbersome. Rpy2 is pretty great, and allows access to any function, albeit with a bit of work. Thus, Python can serve as a main platform which can access R functions for statistics and graphics on an as-needed basis. My hope is that the yhat team finished the port soon, and I’ll bet the statistics packages catch up in the next few years.