Ecological Dynamic Programming Optimization in Python

It’s been awhile since I’ve posted anything. I’ll use the excuse that I’ve been busy, but mostly I just forget. Regardless, I’m learning how to do Dynamic Programming Optimization (DPO), which sounds more complex than it is. The reason for this is that DPO allows us to simulate the behavior of individuals who make decisions based on current patch quality.  DPO is an exciting tool that forms the foundation of individual-based models, which allow us to assess community and ecosystem dynamics as emergent properties of individual decisions based on well-grounded, physiological principles (hence my interest).The underlying idea, as I understand it, is that individuals assess their future reproductive output prior to making a decision (I’m interested in optimal foraging, so I’ll put this in the context of foraging). We can model how individuals make decisions over the course of a lifetime, and track each individual, which then allows us to make quantitative statements about populations, communities, or ecosystems.

This all sounds complicated, and it can be difficult. So it’s easiest to jump right in with an example. Here’s the code, along with a basic explanation of what’s going on, for the very first toy example found in “Dynamic State Variable Models in Ecology” by Clark and Mangel. In the book, they give the computer code in TRUEBASIC, but… in all honesty.. no one I know uses that. I could use R, but we all know my feelings on that. So here’s their toy model programmed and annotated in Python.

Background

Suppose we’re interested in the behavior of fish and how they choose to distribute themselves among three different patches. Two patches are foraging patches and one is a reproduction patch. The first thing we need to do is make an objective fitness function F. In this example, F relates body size to reproductive output. It can be thought of as what the organism believes about its final fitness given a particular body size. Let’s make it increasing, but asymptotic, with body mass:

F = \frac{A(x-x_c)}{x-x_c+x_0}

Here, A is sort of a saturation constant beyond which fitness increases minimally with body size, x_c is the body size at which mortality occurs (0), and x_0 is the initial body size. Lets set x_c=0, A=60 and x_0=0.25*x_{max}, where x_{max}=30 is the individuals maximum possible body size. You have to set the ceiling on body size otherwise organisms grow without bounds.

OK so that’s what the organisms “believes” about its fitness at the end of a season given a specific body mass. The question is: How does the organism forage optimally to maximize fitness at the end of the its lifetime?

The obvious answer would be to simulate a foraging decision at every time step moving forward, and then have it decide again at the new time step, etc. This is computationally expensive, so to circumvent this we work backwards from the end of the season. This saves time because there are far fewer paths to a known destination than there are to an unknown one (essentially).

So we imagine that, at the final time step t_f, the organism’s fitness is given by F for any given body mass.

import numpy as np
import matplotlib.pyplot as pet
import pandas as pd

n_bodysize = 31 # 31 body sizes, from 0-30
n_time = 20 # 20 time steps, 1-20
n_patch = 3, # 3 foraging patches

# make a container for body size (0-30) and time (1-20)
F = np.zeros((n_bodysize, n_time))

# make the function F
x_crit = 0
x_max = 30
acap = 60
x_0 = 0.25*x_max 
t_max = 20

# calculate organism fitness at the final time point
for x in range(x_crit, x_max+1):
    F[x,-1] = acap*(x-x_crit)/(x-x_crit+x_0)
[/code]

Now that we know fitness at the final time point, we can iterate through backwards (called backwards simulation) to decide what the optimal strategy is to achieve each body mass. To determine that, we need to know the fitness in each patch. Let’s start with the two foraging patches, Patch 1 and Patch 2. We need to know four things about each patch: (1) the mortality rate in each patch (m), (2) the probability of finding food in each patch (p), (3) the metabolic cost of visiting a patch (a), and (4) the gain in body mass if food is successfully found (y). For these two patches, let:

Patch 1: m=0.01, p=0.2, a=1, y=2

Patch 2: m=0.2, p=0.5, a=1, y=4

Right away we can see that Patch 2 is high-risk, high reward compared to Patch 1. In each patch, we calculate the next body size given that an animal does (x‘) or does not (x”) find food:

x' = x-a_i+y_i

x'' = x-a_i

Those are simple equations. Body size is the current body size minus the metabolic cost of foraging in the patch and, if successful, the energy gain from foraging. Great. Now we can calculate the expected fitness of each patch as the weighted average of F‘ and F” given the probability of finding food, all times the probability of actually surviving. For these two patches, we make a fitness function (V):

V_i = (1 - m_i)*[p_i*F_{t+1}(x') + (1-p_i)*F'_{t+1}(x'')]

The reproductive patch is different. There is no foraging that occurs in the reproductive patch. Instead, if the organism is above a critical mass x_{rep}, then it devotes all excess energy to reproduction to a limit  (c=4). If the organism is below the reproductive threshold and still visits the foraging patch, it simply loses mass (unless it dies).

OK this is all kind of complicated, so let’s step through it. We know what fitness is at the final time step because of F. So let’s step back one time step. At this penultimate time step, we go through every body mass and calculate fitness for each patch. Let’s do an example. If x=15, then we need to know fitness in Patch 1, Patch 2, and Patch 3. For Patches 1 and 2, we need to know the weight gain if successful and the weight gain if unsuccessful.

x' = max(15-a_1+y_1, x_{max})

x' = 15-1+2 = 16

x''=min(15-a_1, x_{c})

x'' = 15-1 = 14

The min and max functions here just make sure our organism doesn’t grow without limit and dies if metabolic cost exceeds body mass. So these are now the two potential outcomes of foraging in Patch 1 given x=15. The expected fitness of these two body masses is given as F(16) and F(14). Plug all these values into equation V for Patch 1 to get the expected fitness of Patch 1 at a body size of 15. We then take the maximum for all three Patches, save whichever Patch corresponds to that as the optimal foraging decision, and then save as the fitness for body size 15 at that time step. So at body size 15, for Patch 1 is  39, for Patch 2 is 38, and for Patch 3 is 37.6, so the individual will foraging in Patch 1 and now fitness for body size 15 at this time step is 39.

Repeat this procedure for every possible body size, and you’ll get the fitness for every body size at the second to last time step as well as the optimal foraging patch for every body size at that time.

Then, step backwards in time. Repeat this whole procedure, except now the value for each body mass doesn’t come from the equation F, but comes from the fitness we just calculated for each body mass. So for example, if an organism’s foraging decisions at this time step lead it to a body mass of 15, then F is now 39. Again, repeat this for every body mass, and then step back, etc.

Here’s the full Python code for how this is done:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

n_bodysize = 31 # 31 body sizes, 0 - 30
n_time = 20 # 20 time steps, 1 - 20
n_patch = 3 # number of patches

#%% CONATINERS
# make a container for body size (0-30) and time (1-20)
F = np.zeros((n_bodysize, n_time))
# make a container for three patches, each with body size (0-30) and time (1-20)
Vi = np.zeros((n_patch, n_bodysize, n_time))
# make a container for the optimal decision
d = np.zeros((n_bodysize, n_time))

# make a container for the mortality rates in each patch
m = np.zeros(n_patch)
# make a container for the probability of finding food in each path
p = np.zeros(n_patch)
# make a container for the metabolic cost
a = np.zeros(n_patch)
# make a container for food gain in each patch
y = np.zeros(n_patch)
# make a container for reproductive output in each patch
c = np.zeros(n_patch)

#%% CONDITIONS
x_crit = 0
x_max = 30
t_max = 20
x_rep = 4

#%% PARAMETERS
m[0] = 0.01; m[1] = 0.05; m[2] = 0.02
p[0] = 0.2; p[1] = 0.5; p[2] = 0
a[0] = 1; a[1] = 1; a[2] = 1
y[0] = 2; y[1] = 4; y[2] = 0
c[0] = 0; c[1] = 0; c[2] = 4

#%% END CONDITION
acap = 60
x_0 = 0.25*x_max

# Calculate Fitness for every body mass at the final time step
for x in range(x_crit, x_max+1):
    F[x,-1] = acap*(x-x_crit)/(x-x_crit+x_0) # maximum reproductive output for each body size at the final time

#%% SOLVER
for t in range(0, t_max-1)[::-1]: # for every time step, working backward in time #print t
    for x in range(x_crit+1, x_max+1): # iterate over every body size # print x
        for patch in range(0, 3): # for every patch
            if patch in [0,1]: # if in a foraging patch
                xp = x-a[patch]+y[patch] # the updated body size
                 xp = min(xp, x_max) # constraint on max size
                 xpp = int(x-a[patch]) # updated body size if no food
                 Vi[patch,x,t] = (1 - m[patch])*(p[patch]*F[int(xp),t+1] + (1-p[patch])*F[xpp, t+1]) # Calculate expected fitness for foraging in that patch
             else:
                 if x < x_rep: # in a reproduction patch
                     xp = max(x-a[patch], x_crit)
                     Vi[patch, x, t] = (1-m[patch])*F[int(xp), t+1]
                 else:
                     fitness_increment = min(x-x_rep, c[patch]) # resources devoted to reproduction
                     xp = max(x-a[patch]-fitness_increment, x_crit) # new body size is body size minus metabolism minus reproduction
                     Vi[patch, x, t] = fitness_increment + (1-m[patch])*F[int(xp),t+1]
        vmax = max(Vi[0,x,t], Vi[1,x,t])
        vmax = max(vmax, Vi[2,x,t])
        if vmax==Vi[0,x,t]:
                 d[x,t] = 1
        elif vmax==Vi[1,x,t]:
                 d[x,t] = 2
        elif vmax==Vi[2,x,t]:
                 d[x,t] = 3
        F[x,t] = max # the expected fitness at this time step for this body mass
[/code]

This model doesn’t really track individual behavior. What it does is provides an optimal decision for every time and every body mass. So we know what, say, an individual should do if it finds itself small towards the end of its life, or if it finds itself large at the beginning:

T,X = np.meshgrid(range(1, t_max+1), range(x_crit, x_max+1))
df = pd.DataFrame({'X': X.ravel(), 'T': T.ravel(), 'd': d.ravel()})
df = df.pivot('X', 'T', 'd')
sns.heatmap(df, vmin=1, vmax=3, annot=True, cbar_kws={'ticks': [1,2,3]})
plt.gca().invert_yaxis()
plt.show()
[/code]

dynamic_opt.png

And that’s that!! To be honest, I still haven’t wrapped my head around this fully. I wrote this blog post in part to make myself think harder about what was going on, rather than just regurgitating code from the book.

My Ideal Python Setup for Statistical Computing

I’m moving more and more towards Python only (if I’m not there already). So I’ve spent a good deal of time getting the ideal Python IDE setup going. One of the biggest reasons I was slow to move away from R is that R has the excellent RStudio IDE. Python has Spyder, which is comparable, but seems sluggish compared to RStudio. I’ve tried PyCharm, which works well, but I had issues with their interactive interpreter running my STAN models.

A friend pointed me towards SublimeText 3, and I have to say that it’s everything I wanted. The text editor is slick, fast, and has lots of great functions. But more than that, the add-ons are really what make Sublime shine

Vital Add Ons:

  • Side Bar Enhancements: This extends the side-bar project organizer, allowing you to add folders and files, delete things, copy paths, etc. A must have.
  • SublimeREPL: Adds interactive interpreters for an enormous number of languages, both R and Python included. Impossible to work without.
  •  Anaconda: An AMAZING package that extends Sublime by offering live Python linting to make sure my code isn’t screwed up, PEP8 formatters for those of you who like such things, and built in documentation and code retrieval, for those times you’ve forgotten how the function works. Another must have.
  • SublimeGIT: For working with github straight from Sublime. Great if you’re doing any sort of module building.
  • Origami: A new way to split layouts and organize your screen. Not essential, but helpful
  • Bracket Highlighter: Helpful for seeing just what set of parentheses I’m working in.

Sublime and all of these packages are also incredibly customizable, you can make them work and look however you want. I’ve spent a few days customizing my setup and I think its fairly solid. Here are my preferences:

For the main Sublime, I modified the scrolling map, turned off autocomplete (which I find annoying but can still access with Ctrl+space, adjusted the carat so I could actually see it, changed the font, and a few other odds and ends.

{
"always_show_minimap_viewport": true,
"auto_complete": false,
"bold_folder_labels": true,
"caret_style": &amp;quot;phase&amp;quot;,
"color_scheme": "Packages/Theme - Flatland/Flatland Dark.tmTheme",
"draw_minimap_border": true,
"fade_fold_buttons": false,
"font_face": "Deja San Mono",
"font_size": 14,
"highlight_line": true,
"highlight_modified_tabs": true,
"ignored_packages":
[
"Vintage";
],
"line_padding_bottom": 1,
"line_padding_top": 1,
"preview_on_click": false,
"spell_check": true,
"wide_caret": true,
}

For Bracket Highlighter, I changed the style of the highlight:

{
"high_visibility_enabled_by_default": true,
"high_visibility_style": "thin_underline",
"high_visibility_color": "__default__",
}

For Side-Bar Enhancements, I’ve modified the ‘Open With’ options. For Anaconda, I changed a few small things and turned off PEP8 linting, which I hate. I don’t hate linting nor PEP8, but I don’t have much use for PEP8 linting constantly telling me that I put a space somewhere inappropriate.

{
"complete_parameters": true,
"complete_all_parameters": false,
"anaconda_linter_mark_style": "outline",
"pep8": false,
"anaconda_gutter_theme": "basic",
"anaconda_linter_delay": 0.5,
}

I also installed the Flatland Theme to make it pretty. Here is the end result, also showing the Anaconda documentation viewer that I find so awesome:

Screen Shot 2014-11-26 at 3.11.03 PM

I also now use Sublime for all of my R, knitr, and LaTeX work as well. In all, it’s a pretty phenomenal editor that can do everything I need it to and combines at least four separate applications into one (TextWrangler, Spyder, RStudio, TexShop). Now, some day I’ll be able to afford the $70 to turn off that reminder that I haven’t paid (and $15 for LaTeXing).

UPDATE

I forgot to mention snippets. You can create snippets in Sublime that are shortcuts for longer code. For example, I heavily customize my graphs in the same way every time. Instead of typing all the code, I can now just type tplt followed by a tab and I automatically get:


f, ax = plt.subplots()
ax.plot()
#ax.set_ylim([ , ])
#ax.set_xlim([ , ])
ax.set_ylabel(&amp;quot;ylab&amp;quot;)
ax.set_xlabel(&amp;quot;xlab&amp;quot;)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_position(('outward', 10))
#ax.spines['bottom'].set_bounds()
ax.spines['left'].set_position(('outward', 10))
#ax.spines['left'].set_bounds()
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
plt.savefig(,bbox_inches = 'tight')
plt.show()

Great if you rewrite the same code many times.

Cleaning Data and Graphing in R and Python

Python has some pretty awesome data-manipulation and graphing capabilities. If you’re a heavy R-user who dabbles in Python like me, you might wonder what the equivalent commands are in Python for dataframe manipulation. Additionally, I was curious to see how many lines of code it took me to do that same task (load, clean, and graph data) in both R and Python. (I’d like to stop the arguments about efficiency and which language is better than which here, because neither my R nor Python code are the super-efficient, optimal programming methods. They are, however, how I do things. So to me, that’s what matters. Also, I’m not trying to advocate one language over the other (programmers can be a sensitive bunch), I just wanted to post an example showing how to do equivalent tasks in each language).

First, R

# read Data
JapBeet_NoChoice <- read.csv("~/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")
# drop incomplete data
feeding <- subset(JapBeet_NoChoice, Consumption!='NA')
# refactor and clean
feeding$Food_Type <- factor(feeding$Food_Type)
feeding$Temperature[which(feeding$Temperature==33)] <- 35

# subset
plants <- c('Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina')
subDat <- feeding[feeding$Food_Type %in% plants, ]

# make a standard error function for plotting
seFunc <- function(x){
 se <- sd(x) / sqrt(sum(!is.na(x)))
 lims <- c(mean(x) + se, mean(x) - se)
 names(lims) <- c('ymin', 'ymax')
 return(lims)
}

# ggplot!
ggplot(subDat, aes(Temperature, Herb_RGR, fill = Food_Type)) +
 stat_summary(geom = 'errorbar', fun.data = 'seFunc', width = 0, aes(color = Food_Type), show_guide = F) +
 stat_summary(geom = 'point', fun.y = 'mean', size = 3, shape = 21) +
 ylab('Mass Change (g)') +
 xlab(expression('Temperature '*degree*C)) +
 scale_fill_discrete(name = 'Plant Species') +
 theme(
 axis.text = element_text(color = 'black', size = 12),
 axis.title = element_text(size = 14),
 axis.ticks = element_line(color = 'black'),
 legend.key = element_blank(),
 legend.title = element_text(size = 12),
 panel.background = element_rect(color = 'black', fill = NA)
 )
Snazzy!

Snazzy!

Next, Python!

# read data
JapBeet_NoChoice = pd.read_csv("/Users/Nate/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")

# clean up
feeding = JapBeet_NoChoice.dropna(subset = ['Consumption'])
feeding['Temperature'].replace(33, 35, inplace = True)

# subset out the correct plants
keep = ['Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina']
feeding2 = feeding[feeding['Food_Type'].isin(keep)]

# calculate means and SEs
group = feeding2.groupby(['Food_Type', 'Temperature'], as_index = False)
sum_stats = group['Herb_RGR'].agg({'mean' : np.mean, 'SE' : lambda x: x.std() / np.sqrt(x.count())})

# PLOT
for i in range(5):
    py.errorbar(sum_stats[sum_stats['Food_Type'] == keep[i]]['Temperature'],
                sum_stats[sum_stats['Food_Type'] == keep[i]]['mean'],
                yerr = sum_stats[sum_stats['Food_Type'] == keep[i]]['SE'],
                fmt = 'o', ms = 10, capsize = 0, mew = 1, alpha = 0.75,
                label = keep[i])

py.xlabel(u'Temperature (\u00B0C)')
py.ylabel('Mass Change')
py.xlim([18, 37])
py.xticks([20, 25, 30, 35])
py.legend(loc = 'upper left', prop = {'size':10}, fancybox = True, markerscale = 0.7)
py.show()
Snazzy 2!

Snazzy 2!

So, roughly the same number of lines (excluding importing of modules and libraries) although a bit more efficient in Python (barely). For what it’s worth, I showed these two graphs to a friend and asked him which he liked more, he chose Python immediately. Personally, I like them both. It’s hard for me to pick one over the other. I think they’re both great. The curious can see much my older, waaayyy less efficient, much more hideous version of this graph in my paper, but I warn you.. it isn’t pretty. And the code was a nightmare (it was pre-ggplot2 for me, so it was made with R’s base plotting commands which are a beast for this kind of graph).

PyStan: An Intermediate Tutorial of Bayesian Data Analysis in Python

Last week I gave a brief description of Bayesian data analysis using Python and PyStan. The example was pretty basic, and if you’re like me, you hate when you look up examples to figure out what to do and they’re either so simple (like last week) as to be nearly unhelpful or so advanced they’re unintelligible. So suit up, it’s code time.

It's code time

And beware of computer raptors

Here is my attempt to provide code for a simple Bayesian hierarchical model in PyStan, again all the way from the beginning to the end. The model is of a caterpillar (Epimecis hortaria) feeding on three host plants across four temperatures, where each host plant is a random factor. The model is exponential (not linear), but that’s a relatively small change.

import pandas as pd
import pystan
import pylab as py
import numpy as np
import scipy.stats as st

herbResp = pd.read_csv( '/Users/Nate/Documents/FIU/Research/Temp_HerbResponses/Data/Thermal_Curves/Feeding_Assays.csv' )

herb2 = herbResp[herbResp['Comments'] != 'DEAD']
herb2 = herb2.dropna(subset = ['Herb_RGR'])

# pull out Epimecis Data
epimecis = herb2[herb2['Herbivore_Scientific'] == 'Epimecis hortaria']
epimecis = epimecis[epimecis['MassCorr_Consumption_Daily'] > 0]

# set temperature of 33 to 35
epimecis['Temperature'].replace(33, 35, inplace = True)

y = epimecis['MassCorr_Consumption_Daily']
Temp = (epimecis['Temperature'] - epimecis['Temperature'].mean()) / epimecis['Temperature'].std()
N = len(y)

Plant = pd.factorize(epimecis['Food_Type'])[0] + 1
# STAN MOD
epimecisMOD = """
data{
 int N;
 int Plant[N];

vector[N] y;
 vector[N] Temp;

}

parameters{
 vector[3] a;
 vector[3] b;

real<lower = 0> sigma[3];
 real mu_a;
 real mu_b;

real<lower = 0> sigma_a;
 real<lower = 0> sigma_b;
}

model{
 vector[N] mu;

for(i in 1:N){
 mu[i] <- exp(a[Plant[i]] + b[Plant[i]] * Temp[i]);
 y[i] ~ normal(mu[i], sigma[Plant[i]]);
 }

a ~ normal(mu_a, sigma_a);
 b ~ normal(mu_b, sigma_b);
}
"""

dat = {'y':y,
 'Temp':Temp,
 'N':N,
 'Plant':Plant}

fit = pystan.stan(model_code = epimecisMOD, data = dat, chains = 4, iter = 4000, thin = 10)

fit.plot('a')
fit.plot('b')

print fit

# get parameter estimates
a = fit.extract('a', permuted = True)
b = fit.extract('b', permuted = True)

aChains = pd.DataFrame( np.array(a['a']), columns = ['Lindera', 'Liriodendron', 'Sassafras'] )
bChains = pd.DataFrame( np.array(b['b']), columns = ['Lindera', 'Liriodendron', 'Sassafras'] )

bChains.hist(bins = 50, color = 'k', alpha = 0.5)
py.show()

This gives the histogram of parameter estimates for ‘b’:figure_3

Next, plot the output (in a basic plot with no chains of CI):

# PLOTTING ##
# get the median parameter estimates
aMed = aChains.median()
bMed = bChains.median()

# get the mean and stdev of temperature for conversion
mu = epimecis['Temperature'].mean()
std = epimecis['Temperature'].std()

# get a prediction vector and convert it back to original temperature
xPred = np.linspace(min(Temp), max(Temp), 100)
xTemp = xPred*std + mu

# make the predictions for lindera and spicebush
sassPred = np.exp(aMed[2] + bMed[2]*xPred)
linPred = np.exp(aMed[0] + bMed[0]*xPred)

# make a grouped object and thing calculate the mean and SD for each plant-temperature combination
grouped = epimecis.groupby(['Food_Type', 'Temperature'], as_index = False)
summary = grouped['MassCorr_Consumption_Daily'].agg({'mean' : np.mean,
 'sem' : lambda x: x.std() / np.sqrt(x.count()) })
plants = ['Lindera benzoin', 'Liriodendron tulipifera', 'Sassafras albidum']

fig = py.figure(figsize = (6.5, 5.5))

for i in range(3):
    py.errorbar(summary[summary['Food_Type'] == plants[i]]['Temperature'],
        summary[summary['Food_Type'] == plants[i]]['mean'],
        yerr = summary[summary['Food_Type'] == plants[i]]['sem'],
        fmt = 'o', ms = 10, capsize = 0, mew = 1,
        label = plants[i])

py.plot(xTemp, linPred, 'b')
py.plot(xTemp, sassPred, 'r')
py.title('Consumption Rates of Epimecis hortaria')
py.ylabel('Mass-Corrected Daily Consumption (g)')
py.xlabel(u"Temperature (\u00B0C)")
py.xticks([20, 25, 30, 35])
py.xlim([18, 37])
py.legend(loc='upper left', prop = {'size':12},
 fancybox=True, shadow=True)
py.savefig('epimecis.pdf', transparent = True, bbox_inches = 'tight')

py.show()

figure_1And there you have it. A simple hierarchical model from start to an almost finish (almost because I’d probably add CI’s in a real plot).

Stay tuned for a full on analysis from a paper of mine (eventually). Including CI’s, overall effects, group-level effects, and group-level predictors (something that isn’t really available in the current STAN examples). Don’t hide it. I know you’re excited.

PyStan: A Basic Tutorial of Bayesian Data Analysis in Python

I’ve been waiting for PyStan for a little while, and it has arrived. The STAN team has posted great examples of the STAN modeling language and very brief examples of how to run PyStan. However, there examples are brief and stop just after model fitting. They do not include the full runthrough of plotting predictions, traceplots of specific parameters, etc. So I thought I’d do a blog post of basic linear regression in PyStan, detailing how to code the model, fit the model, check the model, and plot the results. The whole shebang. So without further adieu:

# module import
import pystan
import numpy as np
import pylab as py
import pandas as pd

## data simulation
x = np.arange(1, 100, 5)
y = 2.5 + .5 * x + np.random.randn(20) * 10

# get number of observations
N = len(x)

# plot the data
py.plot(x,y, 'o')
py.show()

# STAN model (this is the most important part)
regress_code = """
data {
 int<lower = 0> N; // number of observations
 real y[N]; // response variable
 real x[N]; // predictor variable
}
parameters {
 real a; // intercept
 real b; // slope
 real<lower=0> sigma; // standard deviation
}
transformed parameters {
 real mu[N]; // fitted values

for(i in 1:N)
 mu[i] <- a + b*x[i];
}
model {
 y ~ normal(mu, sigma);
}
"""

# make a dictionary containing all data to be passed to STAN
regress_dat = {'x': x,
 'y': y,
 'N': N}

# Fit the model
fit = pystan.stan(model_code=regress_code, data=regress_dat,
 iter=1000, chains=4)

# model summary
print fit

# show a traceplot of ALL parameters. This is a bear if you have many
fit.traceplot()
py.show()

# Instead, show a traceplot for single parameter
fit.plot(['a'])
py.show()

##### PREDICTION ####

# make a dataframe of parameter estimates for all chains
params = pd.DataFrame({'a': fit.extract('a', permuted=True), 'b': fit.extract('b', permuted=True)})

# next, make a prediction function. Making a function makes every step following this 10 times easier
def stanPred(p):
 fitted = p[0] + p[1] * predX
 return pd.Series({'fitted': fitted})

# make a prediction vector (the values of X for which you want to predict)
predX = np.arange(1, 100)

# get the median parameter estimates
medParam = params.median()
# predict
yhat = stanPred(medParam)

# get the predicted values for each chain. This is super convenient in pandas because
# it is possible to have a single column where each element is a list
chainPreds = params.apply(stanPred, axis = 1)

## PLOTTING

# create a random index for chain sampling
idx = np.random.choice(1999, 50)
# plot each chain. chainPreds.iloc[i, 0] gets predicted values from the ith set of parameter estimates
for i in range(len(idx)):
 py.plot(predX, chainPreds.iloc[idx[i], 0], color='lightgrey')

# original data
py.plot(x, y, 'ko')
# fitted values
py.plot(predX, yhat['fitted'], 'k')

# supplementals
py.xlabel('X')
py.ylabel('Y')

py.show()

This yields the following:

stan_chains

Instead of showing chains (which can make a messy, hard to read plot in some cases), we can show a shaded credible interval region:

# make a function that iterates over every predicted values in every chain and returns the quantiles. For example:

def quantileGet(q):
    # make a list to store the quantiles
    quants = []

    # for every predicted value
    for i in range(len(predX)):
        # make a vector to store the predictions from each chain
        val = []

        # next go down the rows and store the values
        for j in range(chainPreds.shape[0]):
            val.append(chainPreds['fitted'][j][i])

        # return the quantile for the predictions.
        quants.append(np.percentile(val, q))

    return quants

# NOTE THAT NUMPY DOES PERCENTILES, SO MULTIPLE QUANTILE BY 100
# 2.5% quantile
lower = quantileGet(2.5)
#97.5
upper = quantileGet(97.5)

# plot this
fig = py.figure()
ax = fig.add_subplot(111)

# shade the credible interval
ax.fill_between(predX, lower, upper, facecolor = 'lightgrey', edgecolor = 'none')
# plot the data
ax.plot(x, y, 'ko')
# plot the fitted line
ax.plot(predX, yhat['fitted'], 'k')

# supplementals
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.grid()

py.show()

stan_quants

And there you have it! Data analysis in Python just got a whole lot better.