# Multivariate Techniques in Python: EcoPy Alpha Launch!

I’m announcing the alpha launch of EcoPy: Ecological Data Analysis in Python. EcoPy is a Python module that contains a number of  techniques (PCA, CA, CCorA, nMDS, MDS, RDA, etc.) for exploring complex multivariate data. For those of you familiar with R, think of this as the Python equivalent to the ‘vegan‘ package.

However, I’m not done! This is the alpha launch, which means you should exercise caution before using this package for research. I’ve stress-tested a couple of simple examples to ensure I get equivalent output in numerous programs, but I haven’t tested it out with real, messy data yet. There might be broken bits or quirks I don’t know about. For the moment, be sure to verify your results with other software.

That said, I need help! My coding might be sloppy, or you might find errors that I didn’t, or you might suggest improvements to the interface. If you want to contribute, either by helping with coding or stress-testing on real-world examples, let me know. The module is available on github and full documentation is available at readthedocs.org.

# An intuitive explanation for the ‘double-zeroes’ problem with Euclidean distances

First, some background. Given a multivariate dataset with a large number of descriptor variables (i.e. columns in the matrix), ecologists (and others) often try to distill all of the descriptors into a single metric describing the relatedness of the objects in the matrix (i.e. rows). They usually do this by calculating one of many ‘distance’, ‘similarity’, or ‘dissimilarity’ metrics, all of which have various properties. Commonly in ecology, this is done for site x species matrices, where ecologists attempt to describe how sites are related to one another based on community composition. By far the most common is Euclidean distance. It follows from the Pythagorean theorem. Suppose we have two sites, or rows, called ‘1’ and ‘2’, because I’m feeling creative. Then site 1 is a vector $\mathbf{x_1}$ with one entry per species, same for site ‘2’ $\mathbf{x_2}$. The euclidean distance is the sum of squared differences between the two sites:

$\sqrt{ \sum_1^n (x_{1n} - x_{2n})^2 }$

or in vector notation:

$\sqrt{ (\mathbf{x_1} - \mathbf{x_2})'(\mathbf{x_1}-\mathbf{x_2}) }$

We square so far?

The common criticism of Euclidean distances is that it ‘counts double zeros’, so that species absent from both sites actually lead to sites being more similar than otherwise. A number of other metrics, like the chord distance, don’t have this problem. The chord distance is the Euclidean distance of normalized vectors. Define $\mathbf{n_1}$ as the normalized vector of Site 1 $\mathbf{x_1}$ and the same for Site 2.

$\mathbf{n_1} = \frac{\mathbf{x_1}}{\sqrt{\mathbf{x_1'x_1}}}$

and so on for Site 2. Then, the chord distance is identical to the Euclidean distance above:

$\sqrt{ (\mathbf{n_1} - \mathbf{n_2})'(\mathbf{n_1} - \mathbf{n_2}) }$

The question I’ve always had is this.. how can the Euclidean distance count double zeroes while the chord distance, which is Euclidean does not? The answer is that neither of them do. You can add as many double zeroes to either vector and the distance does not change. For example, imagine two sites with three species $\mathbf{x_1} = [0, 4, 8]$ and $\mathbf{x_2} = [0, 1, 1]$. The Euclidean distance for these two sites is 7.6158. The chord distance for these sites is 0.3203. Now, let’s tack on 5 zeroes to each site (5 double zeroes). Amazingly, both the Euclidean and chord distances are unchanged. This is because the zeros cancel out $(0-0)^2 = 0$, so they contribute nothing to the distance. This is the same rationale that Legendre and Legendre give in Numerical Ecology for why double zeroes do not contribute to chi-square metrics, yet the same applies for Euclidean distances.

So what’s the deal with Euclidean distance and double zeroes? Obviously the zeroes cancel, just as in other metrics. The issue comes up when you use Euclidean distances on raw abundances and attempt to make inference about species composition, which leads to the so-called paradox of Euclidean distances. Let’s take the example matrix:

$\begin{bmatrix} 0 & 4 & 8 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \end{bmatrix}$

Sites 1 and 2 share two species in common, while Site 3 is all by its one-sies. If you calculate the Euclidean distances between these sites, you get:

$\begin{bmatrix} 0 & 7.62 & 9 \\ 7.62 & 0 & 1.73 \\ 9 & 1.73 & 0 \end{bmatrix}$

Sites 2 and 3 are more similar than Sites 1 and 2, even though Site 3 shares no species in common!  Let’s try it on the chord distances. Doing that, we get:

$\begin{bmatrix} 0 & 0.32 & 1.41 \\ 0.32 & 0 & 1.41 \\ 1.41 & 1.41 & 0 \end{bmatrix}$

That’s better. Now Site 3 is equally distant from both Sites 1 and 2 since it shares no species in common with either of them. So what the hell? This is why it’s termed a paradox. But if I’ve learned anything by watching the iTunes U lecture of Harvard Stats 110 (Thanks Joel!), it’s that anything called a paradox just means you haven’t thought about it long enough. Here’s a hint: the answer isn’t that Euclidean distance counts double zeroes while Chord does not, as shown above. Especially since Chord is Euclidean, it uses the exact same equation.

The answer is actually much simpler, and non-mathy. Euclidean distances on raw abundance values place a premium on differences in the number of individuals, not species. So it’s actually getting it right. Sites 2 and 3 have 2 and 1 individuals total, respectively. When you take the difference, you’re basically counting up the number of individuals the sites do not share. In that case, it happens to be that Sites 2 and 3 only have three individuals that differ between them. Sites 1 and 3 have 13 individuals that differ between them, and Sites 1 and 2 have 10 individuals that differ between them. So by this math, Sites 2 and 3 actually should be really similar.

Chord distances (and $\chi^2$ distances, and others) standardize the data, taking differences in total abundances out of the equation. Instead, it compares how individuals are distributed across species. Since all of Sites 3 is in the first species, and Sites 1 and 2 distributed their individuals in the second and third species, obviously Sites 1 and 2 will be more similar. This is why McCune and Grace even say that Euclidean distances on relativized species abundances is OK. If you want to compare species composition using Euclidean distances, you need to first take differences in abundances out of the question. All of the other ‘non-zero-counting’ distances more or less do the same thing.

If your question is how sites vary in both abundance AND species composition, then Euclidean distance is probably OK. Just don’t use PCA on species abundances. Ever.

By the way, the iTunes U Harvard Stats 110 series is awesome, and Joel Blitzstein is a great lecturer. Totally worth the time to watch all the lectures. And its free.

Python code for the above is here:


import numpy as np

x1 = np.array([0, 4, 8])
x2 = np.array([0, 1, 1])
Euc_D = np.sqrt( (x1-x2).dot(x1-x2) )

n1 = x1/np.sqrt( x1.dot(x1) )
n2 = n2/np.sqrt( x2.dot(x2) )
Chord_D = np.sqrt( (n1-n2).dot(n1-n2) )

x1_2 = np.append(x1, np.zeros(5))
x2_2 = np.append(x2, np.zeros(5))
Euc_D2 = np.sqrt( (x1_2 - x2_2).dot(x1_2 - x2_2) )

n1_2 = x1_2 / np.sqrt(x1_2.dot(x1_2))
n2_2 = x2_2 / np.sqrt(x2_2.dot(x2_2) )
Chord_D2 = np.sqrt((n1_2 - n2_2).dot(n1_2 - n2_2))

x3 = np.array([1, 0, 0])
Sites = np.array([x1, x2, x3])
Euc_M = np.zeros([3, 3])
for i in xrange(3):
for j in xrange(3):
Euc_M[i,j] = np.sqrt((Sites[i,:] - Sites[j,:]).dot( Sites[i,:] - Sites[j,:] ) )

Chord_Sites = np.apply_along_axis(lambda x: x/np.sqrt(x.dot(x)), 1, Sites )
for i in xrange(3):
for j in xrange(3):
Chord_M[i,j] = np.sqrt( (Chord_Sites[i,:] - Chord_Sites[j,:]).dot( Chord_Sites[i,:] - Chord_Sites[j,:] ) )


# My Ideal Python Setup for Statistical Computing

I’m moving more and more towards Python only (if I’m not there already). So I’ve spent a good deal of time getting the ideal Python IDE setup going. One of the biggest reasons I was slow to move away from R is that R has the excellent RStudio IDE. Python has Spyder, which is comparable, but seems sluggish compared to RStudio. I’ve tried PyCharm, which works well, but I had issues with their interactive interpreter running my STAN models.

A friend pointed me towards SublimeText 3, and I have to say that it’s everything I wanted. The text editor is slick, fast, and has lots of great functions. But more than that, the add-ons are really what make Sublime shine

• Side Bar Enhancements: This extends the side-bar project organizer, allowing you to add folders and files, delete things, copy paths, etc. A must have.
• SublimeREPL: Adds interactive interpreters for an enormous number of languages, both R and Python included. Impossible to work without.
•  Anaconda: An AMAZING package that extends Sublime by offering live Python linting to make sure my code isn’t screwed up, PEP8 formatters for those of you who like such things, and built in documentation and code retrieval, for those times you’ve forgotten how the function works. Another must have.
• SublimeGIT: For working with github straight from Sublime. Great if you’re doing any sort of module building.
• Origami: A new way to split layouts and organize your screen. Not essential, but helpful
• Bracket Highlighter: Helpful for seeing just what set of parentheses I’m working in.

Sublime and all of these packages are also incredibly customizable, you can make them work and look however you want. I’ve spent a few days customizing my setup and I think its fairly solid. Here are my preferences:

For the main Sublime, I modified the scrolling map, turned off autocomplete (which I find annoying but can still access with Ctrl+space, adjusted the carat so I could actually see it, changed the font, and a few other odds and ends.

{
"always_show_minimap_viewport": true,
"auto_complete": false,
"bold_folder_labels": true,
"caret_style": &amp;quot;phase&amp;quot;,
"color_scheme": "Packages/Theme - Flatland/Flatland Dark.tmTheme",
"draw_minimap_border": true,
"font_face": "Deja San Mono",
"font_size": 14,
"highlight_line": true,
"highlight_modified_tabs": true,
"ignored_packages":
[
"Vintage";
],
"preview_on_click": false,
"spell_check": true,
"wide_caret": true,
}


For Bracket Highlighter, I changed the style of the highlight:

{
"high_visibility_enabled_by_default": true,
"high_visibility_style": "thin_underline",
"high_visibility_color": "__default__",
}


For Side-Bar Enhancements, I’ve modified the ‘Open With’ options. For Anaconda, I changed a few small things and turned off PEP8 linting, which I hate. I don’t hate linting nor PEP8, but I don’t have much use for PEP8 linting constantly telling me that I put a space somewhere inappropriate.

{
"complete_parameters": true,
"complete_all_parameters": false,
"anaconda_linter_mark_style": "outline",
"pep8": false,
"anaconda_gutter_theme": "basic",
"anaconda_linter_delay": 0.5,
}


I also installed the Flatland Theme to make it pretty. Here is the end result, also showing the Anaconda documentation viewer that I find so awesome:

I also now use Sublime for all of my R, knitr, and LaTeX work as well. In all, it’s a pretty phenomenal editor that can do everything I need it to and combines at least four separate applications into one (TextWrangler, Spyder, RStudio, TexShop). Now, some day I’ll be able to afford the $70 to turn off that reminder that I haven’t paid (and$15 for LaTeXing).

UPDATE

I forgot to mention snippets. You can create snippets in Sublime that are shortcuts for longer code. For example, I heavily customize my graphs in the same way every time. Instead of typing all the code, I can now just type tplt followed by a tab and I automatically get:


f, ax = plt.subplots()
ax.plot()
#ax.set_ylim([ , ])
#ax.set_xlim([ , ])
ax.set_ylabel(&amp;quot;ylab&amp;quot;)
ax.set_xlabel(&amp;quot;xlab&amp;quot;)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_position(('outward', 10))
#ax.spines['bottom'].set_bounds()
ax.spines['left'].set_position(('outward', 10))
#ax.spines['left'].set_bounds()
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
plt.savefig(,bbox_inches = 'tight')
plt.show()



Great if you rewrite the same code many times.

# A Split-Plot Three-Way ANOVA with STAN and Python

Now that STAN has arrived and has been ported over to Python, I’ve moved all of my data analyses over to Python. At first, it was kind of a pain. Python can do ANOVAs and linear models in an R-like interface, but the stats models module is still under development and there was no support for multi-level models (like split-plots or nested designs). However, with STAN now implemented in Python, it’s possible to code these models yourself and run whatever analyses you want in Python using Bayesian methods and principles (which is even better!).

Here’s an example of a split-plot three-way ANOVA-style analysis programmed in STAN and implemented in Python. Since these data haven’t been published yet, I won’t link to them or discuss them in any way. This post is mainly to help those who might be looking to do something similar. The experimental design has two within-plot factors, chamber temperature and induced, both of which have two levels that I represent as 0 and 1. For example, chamber temperature is either 0 (25˚) or 1 (30˚). Likewise, induced is either 0 (Not Induced) or 1 (Induced). The whole-plot factor is either 0 (Ambient) or 1 (Warmed). This is pretty simply since no factor has more than two levels. There is just one regression coefficient needed to identify each factor.

In a split plot design, the within-plot factors and all interactions involving the within-plot factors occur at the lower level, while the whole-plot factor occurs at the top level. Here it is in STAN. Within-plot regression coefficients were drawn from a multivariate normal distribution and each whole-plot : within-plot treatment combination was given its own variance, rather than assuming constant variance among groups.

splitPlot = """
data{
int<lower = 0> N;     // number of observations
int<lower = 0> J;     // number of plots
real y[N];     // response (Herbivore RGR)
int plot[N];     // plot identifier
int induced[N];     // dummy coding for netting treatment (0 = Not Induced, 1 = Induced)
int temp[N];     // dummy coding for plot temperature (0 = Ambient, 1 = Induced)
int chamber_temp[N];     // dummy coding for the feeding assay temperature (0 = 25, 1 = 30)
int var_group[N];     // dummy coding for variance group
int plot_temp[J];     // dummy coding for plot temperature at the plot level (1 = Ambient, 2 = Induced)
cov_matrix[6] prior_cov;     // prior for covariance matrix of regression coef
vector[6] prior_mu;     // prior for mean of regression coefficients (0)
}
parameters{
real B0[J];     // plot means
vector[6] B;     // Coefficients
real G1;     // plant temperature effects
real mu;     // overall mean
real <lower = 0, upper = 10> sd_y[8];     // sd for each temperature-chamber-induced group
real <lower = 0, upper = 10> sd_b0;     // sd of plot-level means
}
transformed parameters{
vector[N] yhat;
vector[J] B0hat;
vector[N] sd_temp;

for(n in 1:N){
yhat[n] <- B0[plot[n]] + B[1]*induced[n] + B[2]*chamber_temp[n] + B[3]*induced[n]*temp[n] + B[4]*induced[n]*chamber_temp[n] + B[5]*temp[n]*chamber_temp[n] + B[6]*induced[n]*temp[n]*chamber_temp[n];
sd_temp[n] <- sd_y[var_group[n]];
}

for(j in 1:J){
B0hat[j] <- mu + G1*plot_temp[j];
}
}
model{
y ~ normal(yhat, sd_temp);
B0 ~ normal(B0hat, sd_b0);

// PRIOR
mu ~ normal(0, 4);
G1 ~ normal(0, 4);
B ~ multi_normal(prior_mu, prior_cov);
}
generated quantities{
real induced_ambient25;
real induced_warmed25;
real netted_ambient25;
real netted_warmed25;
real induced_ambient30;
real induced_warmed30;
real netted_ambient30;
real netted_warmed30;

induced_ambient25 <- mu + B[1];
netted_ambient25 <- mu;
induced_warmed25 <- mu + G1 + B[1] + B[3];
netted_warmed25 <- mu + G1;

induced_ambient30 <- mu + B[1] + B[2] + B[4];
induced_warmed30 <- mu + G1 + B[1] + B[2] + B[3] + B[4] + B[5] + B[6];
netted_ambient30 <- mu + B[2];
netted_warmed30 <- mu + G1 + B[2] + B[5];
}
"""


Above is the preferred linear-model based specification. It basically says that the predicted values (y-hat) are a function of the plot mean and any within-plot factors. The plot means themselves are a function of an intercept and the whole-plot factors. The mixing is beautiful (100,000 samples, thinning every 20) and quick (just under 3 mins):

You also get good coefficient estimates that are easy to interpret. I won’t show them here because the data aren’t published, but each regression coefficient is directly interpretable as a main effect or interaction (this would get more complicated if one of my factors had >2 levels, but still doable).

There is an alternative, cell-means formulation where you just estimate the cell means directly.


splitPlot = """
data{
int<lower = 0> N; // number of observations
int<lower = 0> J; // number of plots
real y[N]; // response (Herbivore RGR)
int plot[N]; // plot identifier
int induced[N]; // dummy coding for netting treatment (0 = Not Induced, 1 = Induced)
int temp[N]; // dummy coding for plot temperature (0 = Ambient, 1 = Induced)
int chamber_temp[N]; // dummy coding for the feeding assay temperature (0 = 25, 1 = 30)
int var_group[N]; // dummy coding for variance group
int plot_temp[J]; // dummy coding for plot temperature at the plot level (1 = Ambient, 2 = Induced)
}
parameters{
real B0[J]; // plot means
real B1[2]; // netting effects
real B3[2];
real B2[2,2,2];
real G1[2]; // plant temperature effects
real mu; // overall mean
real <lower = 0, upper = 10> sd_y[8]; // common standard deviation
real <lower = 0, upper = 10> sd_b0;
}
transformed parameters{
vector[N] yhat;
vector[J] B0hat;
vector[N] sd_temp;

for(n in 1:N){
yhat[n] <- B0[plot[n]] + B1[induced[n]] + B3[chamber_temp[n]] + B2[induced[n], temp[n], chamber_temp[n]];
sd_temp[n] <- sd_y[var_group[n]];
}

for(j in 1:J){
B0hat[j] <- mu + G1[plot_temp[j]];
}
}
model{
y ~ normal(yhat, sd_temp);
B0 ~ normal(B0hat, sd_b0);

// PRIOR
mu ~ normal(0, 4);
G1 ~ normal(0, 4);
B1 ~ normal(0, 4);
B3 ~ normal(0, 4);
}
generated quantities{
real grand_mean;
real A1[2];
real A2[2];
matrix[2,2] A3;
real netted_ambient25;
real netted_warm25;
real induced_ambient25;
real induced_warm25;
real netted_ambient30;
real netted_warm30;
real induced_ambient30;
real induced_warm30;

netted_ambient25 <- mu + B1[1] + G1[1] + B2[1,1,1] + B3[1];
netted_warm25 <- mu + B1[1] + G1[2] + B2[1,2,1] + B3[1];
induced_ambient25 <- mu + B1[2] + G1[1] + B2[2,1,1] + B3[1];
induced_warm25 <- mu + B1[2] + G1[2] + B2[2,2,1] + B3[1];

netted_ambient30 <- mu + B1[1] + G1[1] + B2[1,1,2] + B3[2];
netted_warm30 <- mu + B1[1] + G1[2] + B2[1,2,2] + B3[2];
induced_ambient30 <- mu + B1[2] + G1[1] + B2[2,1,2] + B3[2];
induced_warm30 <- mu + B1[2] + G1[2] + B2[2,2,2] + B3[2];
}
"""


I don’t like this specification as much because the parameters don’t mix well (although the generated quantities themselves are fine). To get good parameter estimates, you need to somehow impose sum-to-zero constraints, either within the STAN model (which I don’t know how to do) or after the fact by manipulating the raw posterior parameter estimates (which I didn’t care enough to figure out). This formula is also MUCH slower, taking over twice as long as the first model. I suspect is has to do with the 3D array for the interaction term, but I’m not positive.

That said, both models give great predictions of the cell means, of which I will only show one tiny bit (black is observed means, red is modeled means):

# In Defense of Matplotlib

I’ve been doing some reading, and I’ve discovered that a lot of people don’t like matplotlib. Specifically, it seems that the default settings are a big turn off, and I agree. They are pretty hideous. There are a lot of ongoing projects that attempt to rectify matplotlib, or reinvent Python plotting altogether, including Plotly, CairoPlot, Veusz, prettyplotlibSeaborn (which appears to mimic R’s ggplot2), and ggplot itself (which is working to port over R’s ggplot). Some of these are complete language overhauls (Plotly, CairoPlot, Veusz, ggplot) and others are built on matplotlib (Seaborn, prettyplotlib). Either way, there’s a lot of effort being devoted to replacing or redesigning matplotlib. I understand some of it. The matplotlib language is difficult and it’s default settings are horrendous. It takes a lot of tweaking to get to something workable. That being said, matplotlib is so infinitely customizable so that it is capable of making some pretty awesome graphs.

Yes, it takes a bit of work, but because matplotlib is so infinitely customizable, you can make matplotlib graphs look absolutely fantastic. Here are some of my favorites that I’ve made:

I’m proud of the panel plots, in particular. Using a for() loop and some general programming, I can make that panel/lattice plot in about 28 lines of code. It takes me roughly the same number of lines to make both panel plots presented above.

So, although it takes some work, I really see nothing wrong with matplotlib. It works very well, it’s mature, it is more flexible than some of the other modules, and can make some graphs that look pretty outstanding.

That said, I’m still excited for ggplot to be finished. The ability to calculate statistics within the plotting framework (as in the stat_summary() function) and the ease of lattice plots have always appealed to me. Plus, the grammer of graphics language makes a lot of sense and is more intuitive than matplotlib.