Multivariate Techniques in Python: EcoPy Alpha Launch!

I’m announcing the alpha launch of EcoPy: Ecological Data Analysis in Python. EcoPy is a Python module that contains a number of  techniques (PCA, CA, CCorA, nMDS, MDS, RDA, etc.) for exploring complex multivariate data. For those of you familiar with R, think of this as the Python equivalent to the ‘vegan‘ package.

However, I’m not done! This is the alpha launch, which means you should exercise caution before using this package for research. I’ve stress-tested a couple of simple examples to ensure I get equivalent output in numerous programs, but I haven’t tested it out with real, messy data yet. There might be broken bits or quirks I don’t know about. For the moment, be sure to verify your results with other software.

That said, I need help! My coding might be sloppy, or you might find errors that I didn’t, or you might suggest improvements to the interface. If you want to contribute, either by helping with coding or stress-testing on real-world examples, let me know. The module is available on github and full documentation is available at readthedocs.org.

A Split-Plot Three-Way ANOVA with STAN and Python

Now that STAN has arrived and has been ported over to Python, I’ve moved all of my data analyses over to Python. At first, it was kind of a pain. Python can do ANOVAs and linear models in an R-like interface, but the stats models module is still under development and there was no support for multi-level models (like split-plots or nested designs). However, with STAN now implemented in Python, it’s possible to code these models yourself and run whatever analyses you want in Python using Bayesian methods and principles (which is even better!).

Here’s an example of a split-plot three-way ANOVA-style analysis programmed in STAN and implemented in Python. Since these data haven’t been published yet, I won’t link to them or discuss them in any way. This post is mainly to help those who might be looking to do something similar. The experimental design has two within-plot factors, chamber temperature and induced, both of which have two levels that I represent as 0 and 1. For example, chamber temperature is either 0 (25˚) or 1 (30˚). Likewise, induced is either 0 (Not Induced) or 1 (Induced). The whole-plot factor is either 0 (Ambient) or 1 (Warmed). This is pretty simply since no factor has more than two levels. There is just one regression coefficient needed to identify each factor.

In a split plot design, the within-plot factors and all interactions involving the within-plot factors occur at the lower level, while the whole-plot factor occurs at the top level. Here it is in STAN. Within-plot regression coefficients were drawn from a multivariate normal distribution and each whole-plot : within-plot treatment combination was given its own variance, rather than assuming constant variance among groups.

splitPlot = """
data{
    int<lower = 0> N;     // number of observations
    int<lower = 0> J;     // number of plots
    real y[N];     // response (Herbivore RGR)
    int plot[N];     // plot identifier
    int induced[N];     // dummy coding for netting treatment (0 = Not Induced, 1 = Induced)
    int temp[N];     // dummy coding for plot temperature (0 = Ambient, 1 = Induced)
    int chamber_temp[N];     // dummy coding for the feeding assay temperature (0 = 25, 1 = 30)
    int var_group[N];     // dummy coding for variance group
    int plot_temp[J];     // dummy coding for plot temperature at the plot level (1 = Ambient, 2 = Induced)
    cov_matrix[6] prior_cov;     // prior for covariance matrix of regression coef
    vector[6] prior_mu;     // prior for mean of regression coefficients (0)
}
parameters{
    real B0[J];     // plot means
    vector[6] B;     // Coefficients
    real G1;     // plant temperature effects
    real mu;     // overall mean
    real <lower = 0, upper = 10> sd_y[8];     // sd for each temperature-chamber-induced group
    real <lower = 0, upper = 10> sd_b0;     // sd of plot-level means
}
transformed parameters{
    vector[N] yhat;
    vector[J] B0hat;
    vector[N] sd_temp;

    for(n in 1:N){
        yhat[n] <- B0[plot[n]] + B[1]*induced[n] + B[2]*chamber_temp[n] + B[3]*induced[n]*temp[n] + B[4]*induced[n]*chamber_temp[n] + B[5]*temp[n]*chamber_temp[n] + B[6]*induced[n]*temp[n]*chamber_temp[n];
        sd_temp[n] <- sd_y[var_group[n]];
    }

    for(j in 1:J){
        B0hat[j] <- mu + G1*plot_temp[j];
    }
}
model{
    y ~ normal(yhat, sd_temp);
    B0 ~ normal(B0hat, sd_b0);

    // PRIOR
    mu ~ normal(0, 4);
    G1 ~ normal(0, 4);
    B ~ multi_normal(prior_mu, prior_cov);
}
generated quantities{
    real induced_ambient25;
    real induced_warmed25;
    real netted_ambient25;
    real netted_warmed25;
    real induced_ambient30;
    real induced_warmed30;
    real netted_ambient30;
    real netted_warmed30;

    induced_ambient25 <- mu + B[1];
    netted_ambient25 <- mu;
    induced_warmed25 <- mu + G1 + B[1] + B[3];
    netted_warmed25 <- mu + G1;

    induced_ambient30 <- mu + B[1] + B[2] + B[4];
    induced_warmed30 <- mu + G1 + B[1] + B[2] + B[3] + B[4] + B[5] + B[6];
    netted_ambient30 <- mu + B[2];
    netted_warmed30 <- mu + G1 + B[2] + B[5];
}
"""

Above is the preferred linear-model based specification. It basically says that the predicted values (y-hat) are a function of the plot mean and any within-plot factors. The plot means themselves are a function of an intercept and the whole-plot factors. The mixing is beautiful (100,000 samples, thinning every 20) and quick (just under 3 mins):

beautiful_posteriors

You also get good coefficient estimates that are easy to interpret. I won’t show them here because the data aren’t published, but each regression coefficient is directly interpretable as a main effect or interaction (this would get more complicated if one of my factors had >2 levels, but still doable).

There is an alternative, cell-means formulation where you just estimate the cell means directly.


splitPlot = """
data{
    int<lower = 0> N; // number of observations
    int<lower = 0> J; // number of plots
    real y[N]; // response (Herbivore RGR)
    int plot[N]; // plot identifier
    int induced[N]; // dummy coding for netting treatment (0 = Not Induced, 1 = Induced)
    int temp[N]; // dummy coding for plot temperature (0 = Ambient, 1 = Induced)
    int chamber_temp[N]; // dummy coding for the feeding assay temperature (0 = 25, 1 = 30)
    int var_group[N]; // dummy coding for variance group
    int plot_temp[J]; // dummy coding for plot temperature at the plot level (1 = Ambient, 2 = Induced)
}
parameters{
    real B0[J]; // plot means
    real B1[2]; // netting effects
    real B3[2];
    real B2[2,2,2];
    real G1[2]; // plant temperature effects
    real mu; // overall mean
    real <lower = 0, upper = 10> sd_y[8]; // common standard deviation
    real <lower = 0, upper = 10> sd_b0;
}
transformed parameters{
    vector[N] yhat;
    vector[J] B0hat;
    vector[N] sd_temp;

    for(n in 1:N){
        yhat[n] <- B0[plot[n]] + B1[induced[n]] + B3[chamber_temp[n]] + B2[induced[n], temp[n], chamber_temp[n]];
        sd_temp[n] <- sd_y[var_group[n]];
    }

    for(j in 1:J){
        B0hat[j] <- mu + G1[plot_temp[j]];
    }
}
model{
    y ~ normal(yhat, sd_temp);
    B0 ~ normal(B0hat, sd_b0);

    // PRIOR
    mu ~ normal(0, 4);
    G1 ~ normal(0, 4);
    B1 ~ normal(0, 4);
    B3 ~ normal(0, 4);
}
generated quantities{
    real grand_mean;
    real A1[2];
    real A2[2];
    matrix[2,2] A3;
    real netted_ambient25;
    real netted_warm25;
    real induced_ambient25;
    real induced_warm25;
    real netted_ambient30;
    real netted_warm30;
    real induced_ambient30;
    real induced_warm30;

    netted_ambient25 <- mu + B1[1] + G1[1] + B2[1,1,1] + B3[1];
    netted_warm25 <- mu + B1[1] + G1[2] + B2[1,2,1] + B3[1];
    induced_ambient25 <- mu + B1[2] + G1[1] + B2[2,1,1] + B3[1];
    induced_warm25 <- mu + B1[2] + G1[2] + B2[2,2,1] + B3[1];

    netted_ambient30 <- mu + B1[1] + G1[1] + B2[1,1,2] + B3[2];
    netted_warm30 <- mu + B1[1] + G1[2] + B2[1,2,2] + B3[2];
    induced_ambient30 <- mu + B1[2] + G1[1] + B2[2,1,2] + B3[2];
    induced_warm30 <- mu + B1[2] + G1[2] + B2[2,2,2] + B3[2];
}
"""

I don’t like this specification as much because the parameters don’t mix well (although the generated quantities themselves are fine). To get good parameter estimates, you need to somehow impose sum-to-zero constraints, either within the STAN model (which I don’t know how to do) or after the fact by manipulating the raw posterior parameter estimates (which I didn’t care enough to figure out). This formula is also MUCH slower, taking over twice as long as the first model. I suspect is has to do with the 3D array for the interaction term, but I’m not positive.

That said, both models give great predictions of the cell means, of which I will only show one tiny bit (black is observed means, red is modeled means):

tiny

Phylogenies in R and Python

One of the reasons I switched to Python from R is because Python’s phylogenetic capabilities are very well developed, but R is catching up. I’m moving into phylogenetic community ecology, which requires a lot of tree manipulation and calculation of metrics and not so much actual tree construction. Python is excellent at these things and has an excellent module called ETE2. R has a few excellent packages as well, including ape and picante.

I can’t compare and contrast all of the features of R and Python’s phylogenetic capabilities. But since I like making pretty pictures, I thought I’d demonstrate how to plot in both R and Python. I’ll say that making a basic plot is pretty simple in both languages. More complex plots are.. well, more complex. I find that the language of ETE2 is more full featured and better, but it had a pretty steep learning curve. Once you get the hang of it, though, there is nothing you can’t do. More or less.

R’s phylogenetic plotting capabilities are good, but limited when it comes to displaying quantitative data along side it. For example, it’s relatively easy to make a phylogeny where native and introduced species have different colors:


require(picante)

SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

# species cover
fullSpData <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/master_sp_data.csv")
# phylogeny
SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')
# traits
plantTraits <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv")

# Put an underscore in the species names to match with the phylogeny
plantTraits$species <- gsub(' ', '_', plantTraits$species)

#Isolate complete cases of traits
traits <- subset(plantTraits, select = c('species', 'woody', 'introduced', 'SLA', 'seedMass', 'toughness'))
traits <- traits[complete.cases(traits), ]

# Make a phylogeny of species for which traits are present
drops <- SERCphylo$tip.label[!(SERCphylo$tip.label %in% traits$species)]
cleanPhylo <- drop.tip(SERCphylo, drops)

# merge the species with the traits, in the order that they appear in the phylogeny
plotTips <- data.frame('species' = cleanPhylo$tip.label)
plotCols <- merge(plotTips, traits[,c(1,3,4,6)], sort=F)
# make a black/red container
tCols <- c('black', 'red')
# plot the phylogeny, coloring the label black for natives, red for introduced
pT <- plot(cleanPhylo,
show.tip.label = T,
cex = 1,
no.margin = T,
tip.color = tCols[plotCols$introduced + 1],
label.offset = 2)
# put a circle at the tip of each leaf
tiplabels(cex = 0.1, pie = plotCols$introduced, piecol = c('red', 'black'))

Basic R phylogeny

Basic R phylogeny

It’s also relatively easy to display trait data alongside it, using another two other packages, but then you lose the ability to color species differently and, in all honesty, to customize the phylogeny in any way.


require(adephylo)
require(phylobase)
sercDat <- phylo4d(cleanPhylo, plotCols)
table.phylo4d(sercDat)

rTraits

 

 

Python, on the other hand, can do this all in the ETE2 module. The learning curve is a bit steeper, but in all honesty, once you get it down it’s easy and flexible. For example, here’s how to make the first graph above:


import ete2 as ete
import pandas as pd

# load data
traits = pd.read_csv('/Users/Nate/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv')
SERCphylo = ete.Tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

#### TRAIT CLEANUP ####
# put an underscore in trait species
traits['species'] = traits['species'].map(lambda x: x.replace(' ', '_'))
# pull out the relevant traits and only keep complete cases
traits = traits[['species', 'introduced', 'woody', 'SLA', 'seedMass', 'toughness']]
traits = traits.dropna()

# next, prune down the traits data
traitsPrune = traits[traits['species'].isin(SERCphylo.get_leaf_names())]

# prune the phylogeny so only species with traits are kept
SERCphylo.prune(traitsPrune['species'], preserve_branch_length = True)

# basic phylogenetic plot
SERCphylo.show()

You can use dictionaries to make a couple of guides that retain the trait info for each species


# guide for color
cols = [['black', 'red'][x] for x in traitsPrune['introduced']]
colorGuide = dict(zip(traitsPrune['species'], cols))
# weights (scaled to 1)
slaGuide = dict(zip(traitsPrune['species'], traitsPrune['SLA']/traitsPrune['SLA'].max()))
toughGuide = dict(zip(traitsPrune['species'], traitsPrune['toughness']/traitsPrune['toughness'].max()))
seedGuide = dict(zip(traitsPrune['species'], traitsPrune['seedMass']/traitsPrune['seedMass'].max()))

Next, you can use node styles to set the basic tree appearance. For example, ETE2 uses thin lines and puts a circle at every node (i.e. split) by default. We can use the traverse function, which just goes through every single node, and set every node to the same style:


# set the base style of the phylogeny with thick lines
for n in SERCphylo.traverse():
style = ete.NodeStyle()
style['hz_line_width'] = 2
style['vt_line_width'] = 2
style['size'] = 0
n.set_style(style)

This code just says “go through every node, make a default style, but change the line width to 2 and the circle size to 0”. The result is that every node has thicker lines and we’ve removed the circle.

We can go through only the final nodes (the leaves) and tell it to strip out the underscore of the species name, paste in on the end of the branch in italic font, and make the font the color specified in the dictionary above (red if introduced, black if native)


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')

Then, use the treestyle to make a couple of stylistic changes, telling it to apply the layout function, add in some extra spacing between the tips so the phylogeny is readable, and save


ts = ete.TreeStyle()
ts.mode = 'r'
ts.show_leaf_name = False
ts.layout_fn = mylayout
ts.branch_vertical_margin = 4
#ts.force_topology = True
ts.show_scale = False

SERCphylo.render("Python_base.png", w = 1500, units="px", tree_style = ts)

Python_baseIt took a bit more work than R to get this far, but now is the awesome part. We’ve already got a function telling Python to paste a red species name at the end of the branches. We can add in more features, like.. say.. a circle that’s scaled by a trait value by simply adding that to the function. Most of the work is already done. We change the function to:


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
# species name
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')
ete.faces.add_face_to_node(temp2, node, column=0)
# make a circle for SLA, weighted by SLA values
sla = ete.CircleFace(radius = slaGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
sla.margin_left = 10
sla.hz_align = 1
ete.faces.add_face_to_node(sla, node, column = 0, position = 'aligned')
# same with toughness
toughness = ete.CircleFace(radius = toughGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
toughness.margin_left = 40
toughness.hz_align = 1
ete.faces.add_face_to_node(toughness, node, column = 1, position = 'aligned')

The confusing part is that you first have to make a ‘face’ (ete.CircleFace), giving it a radius proportional to the species trait value and color based on its introduced status. Then, we use the margin property (sla.margin_left) to give it some space away from the other objects. Next, use the align property to make it centered (sla.hz_align = 1). The final call is just telling it to actually add the ‘face’, which column to put it in, and where to put it (see the ETE2 tutorial for a guide). Aligned tells it to put it offset from the branch tip so that all circles are in the same spot (rather than being directly at the end of the branch, which could vary). Column just tells it where to put it, once it’s in the aligned position. So now there’s a phylogeny with quantitative trait data, still colored properly. And this is a simple example. The graphs can get much better, depending on what you want to do.

Python_traits

Took me several hours to get this far, because the language is pretty hard to wrap your head around at first. But once you get it, it sets off all kinds of possibilities.

 

 

Cleaning Data and Graphing in R and Python

Python has some pretty awesome data-manipulation and graphing capabilities. If you’re a heavy R-user who dabbles in Python like me, you might wonder what the equivalent commands are in Python for dataframe manipulation. Additionally, I was curious to see how many lines of code it took me to do that same task (load, clean, and graph data) in both R and Python. (I’d like to stop the arguments about efficiency and which language is better than which here, because neither my R nor Python code are the super-efficient, optimal programming methods. They are, however, how I do things. So to me, that’s what matters. Also, I’m not trying to advocate one language over the other (programmers can be a sensitive bunch), I just wanted to post an example showing how to do equivalent tasks in each language).

First, R

# read Data
JapBeet_NoChoice <- read.csv("~/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")
# drop incomplete data
feeding <- subset(JapBeet_NoChoice, Consumption!='NA')
# refactor and clean
feeding$Food_Type <- factor(feeding$Food_Type)
feeding$Temperature[which(feeding$Temperature==33)] <- 35

# subset
plants <- c('Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina')
subDat <- feeding[feeding$Food_Type %in% plants, ]

# make a standard error function for plotting
seFunc <- function(x){
 se <- sd(x) / sqrt(sum(!is.na(x)))
 lims <- c(mean(x) + se, mean(x) - se)
 names(lims) <- c('ymin', 'ymax')
 return(lims)
}

# ggplot!
ggplot(subDat, aes(Temperature, Herb_RGR, fill = Food_Type)) +
 stat_summary(geom = 'errorbar', fun.data = 'seFunc', width = 0, aes(color = Food_Type), show_guide = F) +
 stat_summary(geom = 'point', fun.y = 'mean', size = 3, shape = 21) +
 ylab('Mass Change (g)') +
 xlab(expression('Temperature '*degree*C)) +
 scale_fill_discrete(name = 'Plant Species') +
 theme(
 axis.text = element_text(color = 'black', size = 12),
 axis.title = element_text(size = 14),
 axis.ticks = element_line(color = 'black'),
 legend.key = element_blank(),
 legend.title = element_text(size = 12),
 panel.background = element_rect(color = 'black', fill = NA)
 )
Snazzy!

Snazzy!

Next, Python!

# read data
JapBeet_NoChoice = pd.read_csv("/Users/Nate/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")

# clean up
feeding = JapBeet_NoChoice.dropna(subset = ['Consumption'])
feeding['Temperature'].replace(33, 35, inplace = True)

# subset out the correct plants
keep = ['Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina']
feeding2 = feeding[feeding['Food_Type'].isin(keep)]

# calculate means and SEs
group = feeding2.groupby(['Food_Type', 'Temperature'], as_index = False)
sum_stats = group['Herb_RGR'].agg({'mean' : np.mean, 'SE' : lambda x: x.std() / np.sqrt(x.count())})

# PLOT
for i in range(5):
    py.errorbar(sum_stats[sum_stats['Food_Type'] == keep[i]]['Temperature'],
                sum_stats[sum_stats['Food_Type'] == keep[i]]['mean'],
                yerr = sum_stats[sum_stats['Food_Type'] == keep[i]]['SE'],
                fmt = 'o', ms = 10, capsize = 0, mew = 1, alpha = 0.75,
                label = keep[i])

py.xlabel(u'Temperature (\u00B0C)')
py.ylabel('Mass Change')
py.xlim([18, 37])
py.xticks([20, 25, 30, 35])
py.legend(loc = 'upper left', prop = {'size':10}, fancybox = True, markerscale = 0.7)
py.show()
Snazzy 2!

Snazzy 2!

So, roughly the same number of lines (excluding importing of modules and libraries) although a bit more efficient in Python (barely). For what it’s worth, I showed these two graphs to a friend and asked him which he liked more, he chose Python immediately. Personally, I like them both. It’s hard for me to pick one over the other. I think they’re both great. The curious can see much my older, waaayyy less efficient, much more hideous version of this graph in my paper, but I warn you.. it isn’t pretty. And the code was a nightmare (it was pre-ggplot2 for me, so it was made with R’s base plotting commands which are a beast for this kind of graph).

PyStan: An Intermediate Tutorial of Bayesian Data Analysis in Python

Last week I gave a brief description of Bayesian data analysis using Python and PyStan. The example was pretty basic, and if you’re like me, you hate when you look up examples to figure out what to do and they’re either so simple (like last week) as to be nearly unhelpful or so advanced they’re unintelligible. So suit up, it’s code time.

It's code time

And beware of computer raptors

Here is my attempt to provide code for a simple Bayesian hierarchical model in PyStan, again all the way from the beginning to the end. The model is of a caterpillar (Epimecis hortaria) feeding on three host plants across four temperatures, where each host plant is a random factor. The model is exponential (not linear), but that’s a relatively small change.

import pandas as pd
import pystan
import pylab as py
import numpy as np
import scipy.stats as st

herbResp = pd.read_csv( '/Users/Nate/Documents/FIU/Research/Temp_HerbResponses/Data/Thermal_Curves/Feeding_Assays.csv' )

herb2 = herbResp[herbResp['Comments'] != 'DEAD']
herb2 = herb2.dropna(subset = ['Herb_RGR'])

# pull out Epimecis Data
epimecis = herb2[herb2['Herbivore_Scientific'] == 'Epimecis hortaria']
epimecis = epimecis[epimecis['MassCorr_Consumption_Daily'] > 0]

# set temperature of 33 to 35
epimecis['Temperature'].replace(33, 35, inplace = True)

y = epimecis['MassCorr_Consumption_Daily']
Temp = (epimecis['Temperature'] - epimecis['Temperature'].mean()) / epimecis['Temperature'].std()
N = len(y)

Plant = pd.factorize(epimecis['Food_Type'])[0] + 1
# STAN MOD
epimecisMOD = """
data{
 int N;
 int Plant[N];

vector[N] y;
 vector[N] Temp;

}

parameters{
 vector[3] a;
 vector[3] b;

real<lower = 0> sigma[3];
 real mu_a;
 real mu_b;

real<lower = 0> sigma_a;
 real<lower = 0> sigma_b;
}

model{
 vector[N] mu;

for(i in 1:N){
 mu[i] <- exp(a[Plant[i]] + b[Plant[i]] * Temp[i]);
 y[i] ~ normal(mu[i], sigma[Plant[i]]);
 }

a ~ normal(mu_a, sigma_a);
 b ~ normal(mu_b, sigma_b);
}
"""

dat = {'y':y,
 'Temp':Temp,
 'N':N,
 'Plant':Plant}

fit = pystan.stan(model_code = epimecisMOD, data = dat, chains = 4, iter = 4000, thin = 10)

fit.plot('a')
fit.plot('b')

print fit

# get parameter estimates
a = fit.extract('a', permuted = True)
b = fit.extract('b', permuted = True)

aChains = pd.DataFrame( np.array(a['a']), columns = ['Lindera', 'Liriodendron', 'Sassafras'] )
bChains = pd.DataFrame( np.array(b['b']), columns = ['Lindera', 'Liriodendron', 'Sassafras'] )

bChains.hist(bins = 50, color = 'k', alpha = 0.5)
py.show()

This gives the histogram of parameter estimates for ‘b’:figure_3

Next, plot the output (in a basic plot with no chains of CI):

# PLOTTING ##
# get the median parameter estimates
aMed = aChains.median()
bMed = bChains.median()

# get the mean and stdev of temperature for conversion
mu = epimecis['Temperature'].mean()
std = epimecis['Temperature'].std()

# get a prediction vector and convert it back to original temperature
xPred = np.linspace(min(Temp), max(Temp), 100)
xTemp = xPred*std + mu

# make the predictions for lindera and spicebush
sassPred = np.exp(aMed[2] + bMed[2]*xPred)
linPred = np.exp(aMed[0] + bMed[0]*xPred)

# make a grouped object and thing calculate the mean and SD for each plant-temperature combination
grouped = epimecis.groupby(['Food_Type', 'Temperature'], as_index = False)
summary = grouped['MassCorr_Consumption_Daily'].agg({'mean' : np.mean,
 'sem' : lambda x: x.std() / np.sqrt(x.count()) })
plants = ['Lindera benzoin', 'Liriodendron tulipifera', 'Sassafras albidum']

fig = py.figure(figsize = (6.5, 5.5))

for i in range(3):
    py.errorbar(summary[summary['Food_Type'] == plants[i]]['Temperature'],
        summary[summary['Food_Type'] == plants[i]]['mean'],
        yerr = summary[summary['Food_Type'] == plants[i]]['sem'],
        fmt = 'o', ms = 10, capsize = 0, mew = 1,
        label = plants[i])

py.plot(xTemp, linPred, 'b')
py.plot(xTemp, sassPred, 'r')
py.title('Consumption Rates of Epimecis hortaria')
py.ylabel('Mass-Corrected Daily Consumption (g)')
py.xlabel(u"Temperature (\u00B0C)")
py.xticks([20, 25, 30, 35])
py.xlim([18, 37])
py.legend(loc='upper left', prop = {'size':12},
 fancybox=True, shadow=True)
py.savefig('epimecis.pdf', transparent = True, bbox_inches = 'tight')

py.show()

figure_1And there you have it. A simple hierarchical model from start to an almost finish (almost because I’d probably add CI’s in a real plot).

Stay tuned for a full on analysis from a paper of mine (eventually). Including CI’s, overall effects, group-level effects, and group-level predictors (something that isn’t really available in the current STAN examples). Don’t hide it. I know you’re excited.