Multivariate Techniques in Python: EcoPy Alpha Launch!

I’m announcing the alpha launch of EcoPy: Ecological Data Analysis in Python. EcoPy is a Python module that contains a number of  techniques (PCA, CA, CCorA, nMDS, MDS, RDA, etc.) for exploring complex multivariate data. For those of you familiar with R, think of this as the Python equivalent to the ‘vegan‘ package.

However, I’m not done! This is the alpha launch, which means you should exercise caution before using this package for research. I’ve stress-tested a couple of simple examples to ensure I get equivalent output in numerous programs, but I haven’t tested it out with real, messy data yet. There might be broken bits or quirks I don’t know about. For the moment, be sure to verify your results with other software.

That said, I need help! My coding might be sloppy, or you might find errors that I didn’t, or you might suggest improvements to the interface. If you want to contribute, either by helping with coding or stress-testing on real-world examples, let me know. The module is available on github and full documentation is available at readthedocs.org.

An intuitive explanation for the ‘double-zeroes’ problem with Euclidean distances

First, some background. Given a multivariate dataset with a large number of descriptor variables (i.e. columns in the matrix), ecologists (and others) often try to distill all of the descriptors into a single metric describing the relatedness of the objects in the matrix (i.e. rows). They usually do this by calculating one of many ‘distance’, ‘similarity’, or ‘dissimilarity’ metrics, all of which have various properties. Commonly in ecology, this is done for site x species matrices, where ecologists attempt to describe how sites are related to one another based on community composition. By far the most common is Euclidean distance. It follows from the Pythagorean theorem. Suppose we have two sites, or rows, called ‘1’ and ‘2’, because I’m feeling creative. Then site 1 is a vector \mathbf{x_1} with one entry per species, same for site ‘2’ \mathbf{x_2}. The euclidean distance is the sum of squared differences between the two sites:

\sqrt{ \sum_1^n (x_{1n} - x_{2n})^2 }

or in vector notation:

\sqrt{ (\mathbf{x_1} - \mathbf{x_2})'(\mathbf{x_1}-\mathbf{x_2}) }

We square so far?

The common criticism of Euclidean distances is that it ‘counts double zeros’, so that species absent from both sites actually lead to sites being more similar than otherwise. A number of other metrics, like the chord distance, don’t have this problem. The chord distance is the Euclidean distance of normalized vectors. Define \mathbf{n_1} as the normalized vector of Site 1 \mathbf{x_1} and the same for Site 2.

\mathbf{n_1} = \frac{\mathbf{x_1}}{\sqrt{\mathbf{x_1'x_1}}}

and so on for Site 2. Then, the chord distance is identical to the Euclidean distance above:

\sqrt{ (\mathbf{n_1} - \mathbf{n_2})'(\mathbf{n_1} - \mathbf{n_2}) }

The question I’ve always had is this.. how can the Euclidean distance count double zeroes while the chord distance, which is Euclidean does not? The answer is that neither of them do. You can add as many double zeroes to either vector and the distance does not change. For example, imagine two sites with three species \mathbf{x_1} = [0, 4, 8] and \mathbf{x_2} = [0, 1, 1]. The Euclidean distance for these two sites is 7.6158. The chord distance for these sites is 0.3203. Now, let’s tack on 5 zeroes to each site (5 double zeroes). Amazingly, both the Euclidean and chord distances are unchanged. This is because the zeros cancel out (0-0)^2 = 0, so they contribute nothing to the distance. This is the same rationale that Legendre and Legendre give in Numerical Ecology for why double zeroes do not contribute to chi-square metrics, yet the same applies for Euclidean distances.

So what’s the deal with Euclidean distance and double zeroes? Obviously the zeroes cancel, just as in other metrics. The issue comes up when you use Euclidean distances on raw abundances and attempt to make inference about species composition, which leads to the so-called paradox of Euclidean distances. Let’s take the example matrix:

\begin{bmatrix} 0 & 4 & 8 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \end{bmatrix}

Sites 1 and 2 share two species in common, while Site 3 is all by its one-sies. If you calculate the Euclidean distances between these sites, you get:

\begin{bmatrix} 0 & 7.62 & 9 \\ 7.62 & 0 & 1.73 \\ 9 & 1.73 & 0 \end{bmatrix}

Sites 2 and 3 are more similar than Sites 1 and 2, even though Site 3 shares no species in common!  Let’s try it on the chord distances. Doing that, we get:

\begin{bmatrix} 0 & 0.32 & 1.41 \\ 0.32 & 0 & 1.41 \\ 1.41 & 1.41 & 0 \end{bmatrix}

That’s better. Now Site 3 is equally distant from both Sites 1 and 2 since it shares no species in common with either of them. So what the hell? This is why it’s termed a paradox. But if I’ve learned anything by watching the iTunes U lecture of Harvard Stats 110 (Thanks Joel!), it’s that anything called a paradox just means you haven’t thought about it long enough. Here’s a hint: the answer isn’t that Euclidean distance counts double zeroes while Chord does not, as shown above. Especially since Chord is Euclidean, it uses the exact same equation.

The answer is actually much simpler, and non-mathy. Euclidean distances on raw abundance values place a premium on differences in the number of individuals, not species. So it’s actually getting it right. Sites 2 and 3 have 2 and 1 individuals total, respectively. When you take the difference, you’re basically counting up the number of individuals the sites do not share. In that case, it happens to be that Sites 2 and 3 only have three individuals that differ between them. Sites 1 and 3 have 13 individuals that differ between them, and Sites 1 and 2 have 10 individuals that differ between them. So by this math, Sites 2 and 3 actually should be really similar.

Chord distances (and \chi^2 distances, and others) standardize the data, taking differences in total abundances out of the equation. Instead, it compares how individuals are distributed across species. Since all of Sites 3 is in the first species, and Sites 1 and 2 distributed their individuals in the second and third species, obviously Sites 1 and 2 will be more similar. This is why McCune and Grace even say that Euclidean distances on relativized species abundances is OK. If you want to compare species composition using Euclidean distances, you need to first take differences in abundances out of the question. All of the other ‘non-zero-counting’ distances more or less do the same thing.

If your question is how sites vary in both abundance AND species composition, then Euclidean distance is probably OK. Just don’t use PCA on species abundances. Ever.

By the way, the iTunes U Harvard Stats 110 series is awesome, and Joel Blitzstein is a great lecturer. Totally worth the time to watch all the lectures. And its free.

Python code for the above is here:


import numpy as np

x1 = np.array([0, 4, 8])
x2 = np.array([0, 1, 1])
Euc_D = np.sqrt( (x1-x2).dot(x1-x2) )

n1 = x1/np.sqrt( x1.dot(x1) )
n2 = n2/np.sqrt( x2.dot(x2) )
Chord_D = np.sqrt( (n1-n2).dot(n1-n2) )

x1_2 = np.append(x1, np.zeros(5))
x2_2 = np.append(x2, np.zeros(5))
Euc_D2 = np.sqrt( (x1_2 - x2_2).dot(x1_2 - x2_2) )

n1_2 = x1_2 / np.sqrt(x1_2.dot(x1_2))
n2_2 = x2_2 / np.sqrt(x2_2.dot(x2_2) )
Chord_D2 = np.sqrt((n1_2 - n2_2).dot(n1_2 - n2_2))

x3 = np.array([1, 0, 0])
Sites = np.array([x1, x2, x3])
Euc_M = np.zeros([3, 3])
for i in xrange(3):
    for j in xrange(3):
        Euc_M[i,j] = np.sqrt((Sites[i,:] - Sites[j,:]).dot( Sites[i,:] - Sites[j,:] ) )

Chord_Sites = np.apply_along_axis(lambda x: x/np.sqrt(x.dot(x)), 1, Sites )
for i in xrange(3):
    for j in xrange(3):
        Chord_M[i,j] = np.sqrt( (Chord_Sites[i,:] - Chord_Sites[j,:]).dot( Chord_Sites[i,:] - Chord_Sites[j,:] ) )

Phylogenies in R and Python

One of the reasons I switched to Python from R is because Python’s phylogenetic capabilities are very well developed, but R is catching up. I’m moving into phylogenetic community ecology, which requires a lot of tree manipulation and calculation of metrics and not so much actual tree construction. Python is excellent at these things and has an excellent module called ETE2. R has a few excellent packages as well, including ape and picante.

I can’t compare and contrast all of the features of R and Python’s phylogenetic capabilities. But since I like making pretty pictures, I thought I’d demonstrate how to plot in both R and Python. I’ll say that making a basic plot is pretty simple in both languages. More complex plots are.. well, more complex. I find that the language of ETE2 is more full featured and better, but it had a pretty steep learning curve. Once you get the hang of it, though, there is nothing you can’t do. More or less.

R’s phylogenetic plotting capabilities are good, but limited when it comes to displaying quantitative data along side it. For example, it’s relatively easy to make a phylogeny where native and introduced species have different colors:


require(picante)

SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

# species cover
fullSpData <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/master_sp_data.csv")
# phylogeny
SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')
# traits
plantTraits <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv")

# Put an underscore in the species names to match with the phylogeny
plantTraits$species <- gsub(' ', '_', plantTraits$species)

#Isolate complete cases of traits
traits <- subset(plantTraits, select = c('species', 'woody', 'introduced', 'SLA', 'seedMass', 'toughness'))
traits <- traits[complete.cases(traits), ]

# Make a phylogeny of species for which traits are present
drops <- SERCphylo$tip.label[!(SERCphylo$tip.label %in% traits$species)]
cleanPhylo <- drop.tip(SERCphylo, drops)

# merge the species with the traits, in the order that they appear in the phylogeny
plotTips <- data.frame('species' = cleanPhylo$tip.label)
plotCols <- merge(plotTips, traits[,c(1,3,4,6)], sort=F)
# make a black/red container
tCols <- c('black', 'red')
# plot the phylogeny, coloring the label black for natives, red for introduced
pT <- plot(cleanPhylo,
show.tip.label = T,
cex = 1,
no.margin = T,
tip.color = tCols[plotCols$introduced + 1],
label.offset = 2)
# put a circle at the tip of each leaf
tiplabels(cex = 0.1, pie = plotCols$introduced, piecol = c('red', 'black'))

Basic R phylogeny

Basic R phylogeny

It’s also relatively easy to display trait data alongside it, using another two other packages, but then you lose the ability to color species differently and, in all honesty, to customize the phylogeny in any way.


require(adephylo)
require(phylobase)
sercDat <- phylo4d(cleanPhylo, plotCols)
table.phylo4d(sercDat)

rTraits

 

 

Python, on the other hand, can do this all in the ETE2 module. The learning curve is a bit steeper, but in all honesty, once you get it down it’s easy and flexible. For example, here’s how to make the first graph above:


import ete2 as ete
import pandas as pd

# load data
traits = pd.read_csv('/Users/Nate/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv')
SERCphylo = ete.Tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

#### TRAIT CLEANUP ####
# put an underscore in trait species
traits['species'] = traits['species'].map(lambda x: x.replace(' ', '_'))
# pull out the relevant traits and only keep complete cases
traits = traits[['species', 'introduced', 'woody', 'SLA', 'seedMass', 'toughness']]
traits = traits.dropna()

# next, prune down the traits data
traitsPrune = traits[traits['species'].isin(SERCphylo.get_leaf_names())]

# prune the phylogeny so only species with traits are kept
SERCphylo.prune(traitsPrune['species'], preserve_branch_length = True)

# basic phylogenetic plot
SERCphylo.show()

You can use dictionaries to make a couple of guides that retain the trait info for each species


# guide for color
cols = [['black', 'red'][x] for x in traitsPrune['introduced']]
colorGuide = dict(zip(traitsPrune['species'], cols))
# weights (scaled to 1)
slaGuide = dict(zip(traitsPrune['species'], traitsPrune['SLA']/traitsPrune['SLA'].max()))
toughGuide = dict(zip(traitsPrune['species'], traitsPrune['toughness']/traitsPrune['toughness'].max()))
seedGuide = dict(zip(traitsPrune['species'], traitsPrune['seedMass']/traitsPrune['seedMass'].max()))

Next, you can use node styles to set the basic tree appearance. For example, ETE2 uses thin lines and puts a circle at every node (i.e. split) by default. We can use the traverse function, which just goes through every single node, and set every node to the same style:


# set the base style of the phylogeny with thick lines
for n in SERCphylo.traverse():
style = ete.NodeStyle()
style['hz_line_width'] = 2
style['vt_line_width'] = 2
style['size'] = 0
n.set_style(style)

This code just says “go through every node, make a default style, but change the line width to 2 and the circle size to 0”. The result is that every node has thicker lines and we’ve removed the circle.

We can go through only the final nodes (the leaves) and tell it to strip out the underscore of the species name, paste in on the end of the branch in italic font, and make the font the color specified in the dictionary above (red if introduced, black if native)


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')

Then, use the treestyle to make a couple of stylistic changes, telling it to apply the layout function, add in some extra spacing between the tips so the phylogeny is readable, and save


ts = ete.TreeStyle()
ts.mode = 'r'
ts.show_leaf_name = False
ts.layout_fn = mylayout
ts.branch_vertical_margin = 4
#ts.force_topology = True
ts.show_scale = False

SERCphylo.render("Python_base.png", w = 1500, units="px", tree_style = ts)

Python_baseIt took a bit more work than R to get this far, but now is the awesome part. We’ve already got a function telling Python to paste a red species name at the end of the branches. We can add in more features, like.. say.. a circle that’s scaled by a trait value by simply adding that to the function. Most of the work is already done. We change the function to:


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
# species name
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')
ete.faces.add_face_to_node(temp2, node, column=0)
# make a circle for SLA, weighted by SLA values
sla = ete.CircleFace(radius = slaGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
sla.margin_left = 10
sla.hz_align = 1
ete.faces.add_face_to_node(sla, node, column = 0, position = 'aligned')
# same with toughness
toughness = ete.CircleFace(radius = toughGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
toughness.margin_left = 40
toughness.hz_align = 1
ete.faces.add_face_to_node(toughness, node, column = 1, position = 'aligned')

The confusing part is that you first have to make a ‘face’ (ete.CircleFace), giving it a radius proportional to the species trait value and color based on its introduced status. Then, we use the margin property (sla.margin_left) to give it some space away from the other objects. Next, use the align property to make it centered (sla.hz_align = 1). The final call is just telling it to actually add the ‘face’, which column to put it in, and where to put it (see the ETE2 tutorial for a guide). Aligned tells it to put it offset from the branch tip so that all circles are in the same spot (rather than being directly at the end of the branch, which could vary). Column just tells it where to put it, once it’s in the aligned position. So now there’s a phylogeny with quantitative trait data, still colored properly. And this is a simple example. The graphs can get much better, depending on what you want to do.

Python_traits

Took me several hours to get this far, because the language is pretty hard to wrap your head around at first. But once you get it, it sets off all kinds of possibilities.

 

 

Cleaning Data and Graphing in R and Python

Python has some pretty awesome data-manipulation and graphing capabilities. If you’re a heavy R-user who dabbles in Python like me, you might wonder what the equivalent commands are in Python for dataframe manipulation. Additionally, I was curious to see how many lines of code it took me to do that same task (load, clean, and graph data) in both R and Python. (I’d like to stop the arguments about efficiency and which language is better than which here, because neither my R nor Python code are the super-efficient, optimal programming methods. They are, however, how I do things. So to me, that’s what matters. Also, I’m not trying to advocate one language over the other (programmers can be a sensitive bunch), I just wanted to post an example showing how to do equivalent tasks in each language).

First, R

# read Data
JapBeet_NoChoice <- read.csv("~/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")
# drop incomplete data
feeding <- subset(JapBeet_NoChoice, Consumption!='NA')
# refactor and clean
feeding$Food_Type <- factor(feeding$Food_Type)
feeding$Temperature[which(feeding$Temperature==33)] <- 35

# subset
plants <- c('Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina')
subDat <- feeding[feeding$Food_Type %in% plants, ]

# make a standard error function for plotting
seFunc <- function(x){
 se <- sd(x) / sqrt(sum(!is.na(x)))
 lims <- c(mean(x) + se, mean(x) - se)
 names(lims) <- c('ymin', 'ymax')
 return(lims)
}

# ggplot!
ggplot(subDat, aes(Temperature, Herb_RGR, fill = Food_Type)) +
 stat_summary(geom = 'errorbar', fun.data = 'seFunc', width = 0, aes(color = Food_Type), show_guide = F) +
 stat_summary(geom = 'point', fun.y = 'mean', size = 3, shape = 21) +
 ylab('Mass Change (g)') +
 xlab(expression('Temperature '*degree*C)) +
 scale_fill_discrete(name = 'Plant Species') +
 theme(
 axis.text = element_text(color = 'black', size = 12),
 axis.title = element_text(size = 14),
 axis.ticks = element_line(color = 'black'),
 legend.key = element_blank(),
 legend.title = element_text(size = 12),
 panel.background = element_rect(color = 'black', fill = NA)
 )
Snazzy!

Snazzy!

Next, Python!

# read data
JapBeet_NoChoice = pd.read_csv("/Users/Nate/Documents/FIU/Research/JapBeetle_Temp_Herbivory/Data/No_Choice_Assays/JapBeet_NoChoice.csv")

# clean up
feeding = JapBeet_NoChoice.dropna(subset = ['Consumption'])
feeding['Temperature'].replace(33, 35, inplace = True)

# subset out the correct plants
keep = ['Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina']
feeding2 = feeding[feeding['Food_Type'].isin(keep)]

# calculate means and SEs
group = feeding2.groupby(['Food_Type', 'Temperature'], as_index = False)
sum_stats = group['Herb_RGR'].agg({'mean' : np.mean, 'SE' : lambda x: x.std() / np.sqrt(x.count())})

# PLOT
for i in range(5):
    py.errorbar(sum_stats[sum_stats['Food_Type'] == keep[i]]['Temperature'],
                sum_stats[sum_stats['Food_Type'] == keep[i]]['mean'],
                yerr = sum_stats[sum_stats['Food_Type'] == keep[i]]['SE'],
                fmt = 'o', ms = 10, capsize = 0, mew = 1, alpha = 0.75,
                label = keep[i])

py.xlabel(u'Temperature (\u00B0C)')
py.ylabel('Mass Change')
py.xlim([18, 37])
py.xticks([20, 25, 30, 35])
py.legend(loc = 'upper left', prop = {'size':10}, fancybox = True, markerscale = 0.7)
py.show()
Snazzy 2!

Snazzy 2!

So, roughly the same number of lines (excluding importing of modules and libraries) although a bit more efficient in Python (barely). For what it’s worth, I showed these two graphs to a friend and asked him which he liked more, he chose Python immediately. Personally, I like them both. It’s hard for me to pick one over the other. I think they’re both great. The curious can see much my older, waaayyy less efficient, much more hideous version of this graph in my paper, but I warn you.. it isn’t pretty. And the code was a nightmare (it was pre-ggplot2 for me, so it was made with R’s base plotting commands which are a beast for this kind of graph).

ggplot2 in Python: A major barrier broken

I have been working with Python recently and I have to say, I love it. There’s a learning curve, of course, which has been frustrating. However, once I got comfortable with it (and continue to do so), I found that working with dataframes in Python (via pandas) is easy, fast, and can do some awesome things pretty simple manner. That’s a subject for another time.

One other thing (among many) that I like about Python is matplotlib, the plotting module. It makes great plots in very few lines of code. It’s default settings are great, so it doesn’t take a whole lot of tweaking to make a publishable plot. It is also very intuitive.

There are two major barriers that stop me from adopting Python all out. First is the infancy of its stats packages, and since I do mostly data analysis as opposed to modeling this is a problem. Python has a large number of basic stats modules, including some that allow a formula interface akin to R. However, it lacks things like statistical non-linear regression (beyond simple curve fitting) and mixed effects models. Given the rate of development in Python, I doubt that this problem will last much longer. R’s stats functions are much more numerous and under much more rapid development for specialized applications like animal movement, species distribution modeling, econometrics, etc. However, the recent implementation of Bayesian modelling via STAN into Python (as Pystan) by Andrew Gelman and his team has removed the major issue, as I can now do almost any test (linear, non-linear, multi-level) that I can write a Bayesian MCMC model (more on this later, because Pystan is awesome but still young. In particular, implementing mcmcplots for stan fits would be amazing).

The second major barrier was trellis plots. R has ggplot2, which makes trellis plotting ludicrously simple. It also allows me to plot summary functions (like means and S.E.) without having to actually aggregate these values out. This isn’t a problem on a simple plot but it rapidly becomes cumbersome on a multi-panel plot wherein these summary statistics need to be calculated for each panel. ggplot2 allows me to circumvent this. There is an active port of ggplot2 to Python ongoing, but it too is still young and many functions are incomplete (i.e. boxplots). The good news is that I’ve discovered rpy2, which allows me to call R functions (like ggplot!) in Python. I can do all of my data sorting, stats, etc. in Python, then use rpy2/ggplot to make the plots.

A short example:

# Import the necessary modules
import numpy as np
import pandas as pd
import rpy2.robjects as robj
import rpy2.robjects.pandas2ri # for dataframe conversion
from rpy2.robjects.packages import importr

# First, make some random data
x = np.random.normal(loc = 5, scale = 2, size = 10)
y = x + np.random.normal(loc = 0, scale = 2, size = 10)

# Make these into a pandas dataframe. I do this because
# more often than not, I read in a pandas dataframe, so this
# shows how to use a pandas dataframe to plot in ggplot
testData = pd.DataFrame( {'x':x, 'y':y} )
# it looks just like a dataframe from R
print testData

# Next, you make an robject containing function that makes the plot.
# the language in the function is pure R, so it can be anything
# note that the R environment is blank to start, so ggplot2 has to be
# loaded
plotFunc = robj.r("""
 library(ggplot2)

function(df){
 p <- ggplot(df, aes(x, y)) +
 geom_point( )

print(p)
 }
""")

# import graphics devices. This is necessary to shut the graph off
# otherwise it just hangs and freezes python
gr = importr('grDevices')

# convert the testData to an R dataframe
robj.pandas2ri.activate()
testData_R = robj.conversion.py2ri(testData)

# run the plot function on the dataframe
plotFunc(testData_R)

# ask for input. This requires you to press enter, otherwise the plot
# window closes immediately
raw_input()

# shut down the window using dev_off()
gr.dev_off()

# you can even save the output once you like it
plotFunc_2 = robj.r("""
 library(ggplot2)

function(df){
 p <- ggplot(df, aes(x, y)) +
 geom_point( ) +
 theme(
 panel.background = element_rect(fill = NA, color = 'black')
 )

ggsave('rpy2_magic.pdf', plot = p, width = 6.5, height = 5.5)
 }
""")

plotFunc_2(testData_R)

rpy2_magic

So there you have it. Python now has the capability to access ggplot2, although it can be a bit cumbersome. Rpy2 is pretty great, and allows access to any function, albeit with a bit of work. Thus, Python can serve as a main platform which can access R functions for statistics and graphics on an as-needed basis. My hope is that the yhat team finished the port soon, and I’ll bet the statistics packages catch up in the next few years.