In Defense of Matplotlib

I’ve been doing some reading, and I’ve discovered that a lot of people don’t like matplotlib. Specifically, it seems that the default settings are a big turn off, and I agree. They are pretty hideous. There are a lot of ongoing projects that attempt to rectify matplotlib, or reinvent Python plotting altogether, including Plotly, CairoPlot, Veusz, prettyplotlibSeaborn (which appears to mimic R’s ggplot2), and ggplot itself (which is working to port over R’s ggplot). Some of these are complete language overhauls (Plotly, CairoPlot, Veusz, ggplot) and others are built on matplotlib (Seaborn, prettyplotlib). Either way, there’s a lot of effort being devoted to replacing or redesigning matplotlib. I understand some of it. The matplotlib language is difficult and it’s default settings are horrendous. It takes a lot of tweaking to get to something workable. That being said, matplotlib is so infinitely customizable so that it is capable of making some pretty awesome graphs.

Yes, it takes a bit of work, but because matplotlib is so infinitely customizable, you can make matplotlib graphs look absolutely fantastic. Here are some of my favorites that I’ve made:

Figure2

mtdOrigin_panel

miami_pretty_70

np_Regression_withVar2

 

similarity_Small_Yearly

Figure4

I’m proud of the panel plots, in particular. Using a for() loop and some general programming, I can make that panel/lattice plot in about 28 lines of code. It takes me roughly the same number of lines to make both panel plots presented above.

So, although it takes some work, I really see nothing wrong with matplotlib. It works very well, it’s mature, it is more flexible than some of the other modules, and can make some graphs that look pretty outstanding.

That said, I’m still excited for ggplot to be finished. The ability to calculate statistics within the plotting framework (as in the stat_summary() function) and the ease of lattice plots have always appealed to me. Plus, the grammer of graphics language makes a lot of sense and is more intuitive than matplotlib.

 

 

Phylogenies in R and Python

One of the reasons I switched to Python from R is because Python’s phylogenetic capabilities are very well developed, but R is catching up. I’m moving into phylogenetic community ecology, which requires a lot of tree manipulation and calculation of metrics and not so much actual tree construction. Python is excellent at these things and has an excellent module called ETE2. R has a few excellent packages as well, including ape and picante.

I can’t compare and contrast all of the features of R and Python’s phylogenetic capabilities. But since I like making pretty pictures, I thought I’d demonstrate how to plot in both R and Python. I’ll say that making a basic plot is pretty simple in both languages. More complex plots are.. well, more complex. I find that the language of ETE2 is more full featured and better, but it had a pretty steep learning curve. Once you get the hang of it, though, there is nothing you can’t do. More or less.

R’s phylogenetic plotting capabilities are good, but limited when it comes to displaying quantitative data along side it. For example, it’s relatively easy to make a phylogeny where native and introduced species have different colors:


require(picante)

SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

# species cover
fullSpData <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/master_sp_data.csv")
# phylogeny
SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')
# traits
plantTraits <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv")

# Put an underscore in the species names to match with the phylogeny
plantTraits$species <- gsub(' ', '_', plantTraits$species)

#Isolate complete cases of traits
traits <- subset(plantTraits, select = c('species', 'woody', 'introduced', 'SLA', 'seedMass', 'toughness'))
traits <- traits[complete.cases(traits), ]

# Make a phylogeny of species for which traits are present
drops <- SERCphylo$tip.label[!(SERCphylo$tip.label %in% traits$species)]
cleanPhylo <- drop.tip(SERCphylo, drops)

# merge the species with the traits, in the order that they appear in the phylogeny
plotTips <- data.frame('species' = cleanPhylo$tip.label)
plotCols <- merge(plotTips, traits[,c(1,3,4,6)], sort=F)
# make a black/red container
tCols <- c('black', 'red')
# plot the phylogeny, coloring the label black for natives, red for introduced
pT <- plot(cleanPhylo,
show.tip.label = T,
cex = 1,
no.margin = T,
tip.color = tCols[plotCols$introduced + 1],
label.offset = 2)
# put a circle at the tip of each leaf
tiplabels(cex = 0.1, pie = plotCols$introduced, piecol = c('red', 'black'))

Basic R phylogeny

Basic R phylogeny

It’s also relatively easy to display trait data alongside it, using another two other packages, but then you lose the ability to color species differently and, in all honesty, to customize the phylogeny in any way.


require(adephylo)
require(phylobase)
sercDat <- phylo4d(cleanPhylo, plotCols)
table.phylo4d(sercDat)

rTraits

 

 

Python, on the other hand, can do this all in the ETE2 module. The learning curve is a bit steeper, but in all honesty, once you get it down it’s easy and flexible. For example, here’s how to make the first graph above:


import ete2 as ete
import pandas as pd

# load data
traits = pd.read_csv('/Users/Nate/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv')
SERCphylo = ete.Tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

#### TRAIT CLEANUP ####
# put an underscore in trait species
traits['species'] = traits['species'].map(lambda x: x.replace(' ', '_'))
# pull out the relevant traits and only keep complete cases
traits = traits[['species', 'introduced', 'woody', 'SLA', 'seedMass', 'toughness']]
traits = traits.dropna()

# next, prune down the traits data
traitsPrune = traits[traits['species'].isin(SERCphylo.get_leaf_names())]

# prune the phylogeny so only species with traits are kept
SERCphylo.prune(traitsPrune['species'], preserve_branch_length = True)

# basic phylogenetic plot
SERCphylo.show()

You can use dictionaries to make a couple of guides that retain the trait info for each species


# guide for color
cols = [['black', 'red'][x] for x in traitsPrune['introduced']]
colorGuide = dict(zip(traitsPrune['species'], cols))
# weights (scaled to 1)
slaGuide = dict(zip(traitsPrune['species'], traitsPrune['SLA']/traitsPrune['SLA'].max()))
toughGuide = dict(zip(traitsPrune['species'], traitsPrune['toughness']/traitsPrune['toughness'].max()))
seedGuide = dict(zip(traitsPrune['species'], traitsPrune['seedMass']/traitsPrune['seedMass'].max()))

Next, you can use node styles to set the basic tree appearance. For example, ETE2 uses thin lines and puts a circle at every node (i.e. split) by default. We can use the traverse function, which just goes through every single node, and set every node to the same style:


# set the base style of the phylogeny with thick lines
for n in SERCphylo.traverse():
style = ete.NodeStyle()
style['hz_line_width'] = 2
style['vt_line_width'] = 2
style['size'] = 0
n.set_style(style)

This code just says “go through every node, make a default style, but change the line width to 2 and the circle size to 0″. The result is that every node has thicker lines and we’ve removed the circle.

We can go through only the final nodes (the leaves) and tell it to strip out the underscore of the species name, paste in on the end of the branch in italic font, and make the font the color specified in the dictionary above (red if introduced, black if native)


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')

Then, use the treestyle to make a couple of stylistic changes, telling it to apply the layout function, add in some extra spacing between the tips so the phylogeny is readable, and save


ts = ete.TreeStyle()
ts.mode = 'r'
ts.show_leaf_name = False
ts.layout_fn = mylayout
ts.branch_vertical_margin = 4
#ts.force_topology = True
ts.show_scale = False

SERCphylo.render("Python_base.png", w = 1500, units="px", tree_style = ts)

Python_baseIt took a bit more work than R to get this far, but now is the awesome part. We’ve already got a function telling Python to paste a red species name at the end of the branches. We can add in more features, like.. say.. a circle that’s scaled by a trait value by simply adding that to the function. Most of the work is already done. We change the function to:


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
# species name
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')
ete.faces.add_face_to_node(temp2, node, column=0)
# make a circle for SLA, weighted by SLA values
sla = ete.CircleFace(radius = slaGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
sla.margin_left = 10
sla.hz_align = 1
ete.faces.add_face_to_node(sla, node, column = 0, position = 'aligned')
# same with toughness
toughness = ete.CircleFace(radius = toughGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
toughness.margin_left = 40
toughness.hz_align = 1
ete.faces.add_face_to_node(toughness, node, column = 1, position = 'aligned')

The confusing part is that you first have to make a ‘face’ (ete.CircleFace), giving it a radius proportional to the species trait value and color based on its introduced status. Then, we use the margin property (sla.margin_left) to give it some space away from the other objects. Next, use the align property to make it centered (sla.hz_align = 1). The final call is just telling it to actually add the ‘face’, which column to put it in, and where to put it (see the ETE2 tutorial for a guide). Aligned tells it to put it offset from the branch tip so that all circles are in the same spot (rather than being directly at the end of the branch, which could vary). Column just tells it where to put it, once it’s in the aligned position. So now there’s a phylogeny with quantitative trait data, still colored properly. And this is a simple example. The graphs can get much better, depending on what you want to do.

Python_traits

Took me several hours to get this far, because the language is pretty hard to wrap your head around at first. But once you get it, it sets off all kinds of possibilities.

 

 

PyStan: A Second Intermediate Tutorial of Bayesian Analysis in Python

I promised a while ago that I’d give a more advanced tutorial of using PySTAN and Python to fit a Bayesian hierarchical model. Well, I’ve been waiting for a while because the paper was in review and then in print. Now, it’s out and I’m super excited! My first pure Python paper, using Python for all data manipulation, analysis, and plotting.

The question was whether temperature affects herbivory by insects in any predictable way. I gathered as many insect species as I could and fed them whatever they ate at multiple temperatures. Check the article for more detail, but the idea was to fit a curve to all 21 herbivore-plant pairs as well as to estimate the overall effect of temperature. We also suspected (incorrectly as it turns out) that plant nutritional quality might be a good predictor of the shape of these curves, so we included that as a group-level predictor.

Anyway, here’s the code, complete with STAN model, posterior manipulations, and some plotting. First, here’s the actual STAN model. NOTE: a lot of data manipulation and whatnot is missing. The point is not to show that but to describe how to fit a STAN model and work with the output. Anyone who wants the full code and data to work with can find it on my website or in Dryad (see the article for a link).


stanMod = """
data{
int<lower = 1> N;
int<lower = 1> J;
vector[N] Temp;
vector[N] y;
int<lower = 1> Curve[N];
vector[J] pctN;
vector[J] pctP;
vector[J] pctH20;
matrix[3, 3] R;
}

parameters{
vector[3] beta[J];
real mu_a;
real mu_b;
real mu_c;
real g_a1;
real g_a2;
real g_a3;
real g_b1;
real g_b2;
real g_b3;
real g_c1;
real g_c2;
real g_c3;
real<lower = 0> sigma[J];
cov_matrix[3] Tau;
}

transformed parameters{
vector[N] y_hat;
vector[N] sd_y;
vector[3] beta_hat[J];
// First, get the predicted value as an exponential curve
// Also make a dummy variable for SD so it can be vectorized
for (n in 1:N){
y_hat[n] <- exp( beta[Curve[n], 1] + beta[Curve[n], 2]*Temp[n] + beta[Curve[n], 3]*pow(Temp[n], 2) );
sd_y[n] <- sigma[Curve[n]];
}
// Next, for each group-level coefficient, include the group-level predictors
for (j in 1:J){
beta_hat[j, 1] <- mu_a + g_a1*pctN[j] + g_a2*pctP[j] + g_a3*pctH20[j];
beta_hat[j, 2] <- mu_b + g_b1*pctN[j] + g_b2*pctP[j] + g_b3*pctH20[j];
beta_hat[j, 3] <- mu_c + g_c1*pctN[j] + g_c2*pctP[j] + g_c3*pctH20[j];
}
}

model{
y ~ normal(y_hat, sd_y);
for (j in 1:J){
beta[j] ~ multi_normal_prec(beta_hat[j], Tau);
}
// PRIORS
mu_a ~ normal(0, 1);
mu_b ~ normal(0, 1);
mu_c ~ normal(0, 1);
g_a1 ~ normal(0, 1);
g_a2 ~ normal(0, 1);
g_a3 ~ normal(0, 1);
g_b1 ~ normal(0, 1);
g_b2 ~ normal(0, 1);
g_b3 ~ normal(0, 1);
g_c1 ~ normal(0, 1);
g_c2 ~ normal(0, 1);
g_c3 ~ normal(0, 1);
sigma ~ uniform(0, 100);
Tau ~ wishart(4, R);
}
"""

# fit the model!
fit = pystan.stan(model_code=stanMod, data=dat,
 iter=10000, chains=4, thin = 20)

Not so bad, was it? It’s actually pretty straightforward.

After the model has been run, we work with the output. We can check traceplots of various parameters:

fit.plot(['mu_a', 'mu_b', 'mu_c'])
fit.plot(['g_a1', 'g_a2', 'g_a3'])
fit.plot(['g_b1', 'g_b2', 'g_b3'])
fit.plot(['g_c1', 'g_c2', 'g_c3'])
py.show()

As a brief example, we can extract the overall coefficients and plot them:


mus = fit.extract(['mu_a', 'mu_b', 'mu_c'])
mus = pd.DataFrame({'Intercept' : mus['mu_a'], 'Linear' : mus['mu_b'], 'Quadratic' : mus['mu_c']})

py.plot(mus.median(), range(3), 'ko', ms = 10)
py.hlines(range(3), mus.quantile(0.025), mus.quantile(0.975), 'k')
py.hlines(range(3), mus.quantile(0.1), mus.quantile(0.9), 'k', linewidth = 3)
py.axvline(0, linestyle = 'dashed', color = 'k')
py.xlabel('Median Coefficient Estimate (80 and 95% CI)')
py.yticks(range(3), ['Intercept', 'Exponential', 'Gaussian'])
py.ylim([-0.5, 2.5])
py.title('Overall Coefficients')
py.gca().invert_yaxis()
py.show()

The resulting plot:

Figure3

We can also make a prediction line with confidence intervals:

#first, define a prediction function
def predFunc(x, v = 1):
 yhat = np.exp( x[0] + x[1]*xPred + v*x[2]*xPred**2 )
 return pd.Series({'yhat' : yhat})

# next, define a function to return the quantiles at each predicted value
def quantGet(data , q):
 quant = []
 for i in range(len(xPred)):
 val = []
 for j in range(len(data)):
 val.append( data[j][i] )
 quant.append( np.percentile(val, q) )
 return quant

# make a vector of temperatures to predict (and convert to the real temperature scale)
xPred = np.linspace(feeding_Final['Temp_Scale'].min(), feeding_Final['Temp_Scale'].max(), 100)
realTemp = xPred * feeding_Final['Temperature'].std() + feeding_Final['Temperature'].mean()

# make predictions for every chain (in overall effects)
ovPred = mus.apply(predFunc, axis = 1)

# get lower and upper quantiles
ovLower = quantGet(ovPred['yhat'], 2.5)
ovLower80 = quantGet(ovPred['yhat'], 10)
ovUpper80 = quantGet(ovPred['yhat'], 90)
ovUpper = quantGet(ovPred['yhat'], 97.5)

# get median predictions
ovPred = predFunc(mus.median())

Then, just plot the median (ovPred) and the quantiles against temperature (realTemp). With just a little effort, you can wind up with something that looks pretty good:

Figure2

I apologize for only posting part of the code, but the full script is really long. This should serve as a pretty good start for anyone looking to use Python as their Bayesian platform of choice. Anyone interested can get the data and full script from my article or website and give it a try! It’s all publicly available.

Car Talk Puzzler Solution: Man Growing A Lawn

I love car talk. My radio station in Miami plays reruns every Saturday (which I just learned were reruns, since I just learned they no longer produce new shows). Probably my favorite part of Car Talk is the puzzler, which is usually some math-based word problem or some word-based math problem. Last weekend’s puzzler was a great one. The short version goes like this:

“A man has a dirt yard. On June 1, he goes to a garden store and tries to figure out how he can get a nice, lush lawn in time for his July 4 party. After discussing his options, the store clerk suggests a new technique: a plug of grass that doubles every day. The clerk did the calculations and figured out that, if the man were to buy one plug, his lawn would be covered by June 30 (in 30 days). The man thinks that he would be cutting it too close, so he buys two plugs. How many days does it take for his yard to get covered?”

As a biologist/ecologist, I recognized this immediately. This is the problem of exponential population growth! You can write down a really simple equation. Suppose that today is day 1 (t1) and tomorrow is day 2 (t2). The population of tomorrow is twice that of today:

latex-image-1

The lawn size on day 3 (t3) is twice that of the lawn size on day 2:

latex-image-2

and so on and so on until you get to t(30). However, writing all of these calculations is time consuming and kind of a pain. So you can consolidate. If t(2) is twice that of t(1), and we know that t(1) is twice that of t(0), we can substitute in:

latex-image-3

which gives a general equation:

latex-image-4

where the population on the ith day is simply 2 raised to (i – 1) multiplied by the initial population size. So we can quickly calculate the lawn size on day 30:

latex-image-5

That’s the first bit of information we need. But the man started with two plugs (to be safe). We need to know how many days it takes to reach that same size:

latex-image-6

To solve for (i-1), we can use the awesomeness of (natural) logarithms:

latex-image-7

The Car Talk guys used nice numbers that led to nice integer solutions. The neat part is that, by starting with two plugs “for insurance” and spending twice as much, the man only saved himself a single day. The proof of that is in this plot:

figure_1

The two plug line (blue) reaches the final size of the one plug line (green) only one day before

This puzzler was easy for me because I recognized the word problem immediately as the exponential growth of populations, something I’m fairly familiar with.