An intuitive explanation for the ‘double-zeroes’ problem with Euclidean distances

First, some background. Given a multivariate dataset with a large number of descriptor variables (i.e. columns in the matrix), ecologists (and others) often try to distill all of the descriptors into a single metric describing the relatedness of the objects in the matrix (i.e. rows). They usually do this by calculating one of many ‘distance’, ‘similarity’, or ‘dissimilarity’ metrics, all of which have various properties. Commonly in ecology, this is done for site x species matrices, where ecologists attempt to describe how sites are related to one another based on community composition. By far the most common is Euclidean distance. It follows from the Pythagorean theorem. Suppose we have two sites, or rows, called ‘1’ and ‘2’, because I’m feeling creative. Then site 1 is a vector \mathbf{x_1} with one entry per species, same for site ‘2’ \mathbf{x_2}. The euclidean distance is the sum of squared differences between the two sites:

\sqrt{ \sum_1^n (x_{1n} - x_{2n})^2 }

or in vector notation:

\sqrt{ (\mathbf{x_1} - \mathbf{x_2})'(\mathbf{x_1}-\mathbf{x_2}) }

We square so far?

The common criticism of Euclidean distances is that it ‘counts double zeros’, so that species absent from both sites actually lead to sites being more similar than otherwise. A number of other metrics, like the chord distance, don’t have this problem. The chord distance is the Euclidean distance of normalized vectors. Define \mathbf{n_1} as the normalized vector of Site 1 \mathbf{x_1} and the same for Site 2.

\mathbf{n_1} = \frac{\mathbf{x_1}}{\sqrt{\mathbf{x_1'x_1}}}

and so on for Site 2. Then, the chord distance is identical to the Euclidean distance above:

\sqrt{ (\mathbf{n_1} - \mathbf{n_2})'(\mathbf{n_1} - \mathbf{n_2}) }

The question I’ve always had is this.. how can the Euclidean distance count double zeroes while the chord distance, which is Euclidean does not? The answer is that neither of them do. You can add as many double zeroes to either vector and the distance does not change. For example, imagine two sites with three species \mathbf{x_1} = [0, 4, 8] and \mathbf{x_2} = [0, 1, 1]. The Euclidean distance for these two sites is 7.6158. The chord distance for these sites is 0.3203. Now, let’s tack on 5 zeroes to each site (5 double zeroes). Amazingly, both the Euclidean and chord distances are unchanged. This is because the zeros cancel out (0-0)^2 = 0, so they contribute nothing to the distance. This is the same rationale that Legendre and Legendre give in Numerical Ecology for why double zeroes do not contribute to chi-square metrics, yet the same applies for Euclidean distances.

So what’s the deal with Euclidean distance and double zeroes? Obviously the zeroes cancel, just as in other metrics. The issue comes up when you use Euclidean distances on raw abundances and attempt to make inference about species composition, which leads to the so-called paradox of Euclidean distances. Let’s take the example matrix:

\begin{bmatrix} 0 & 4 & 8 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \end{bmatrix}

Sites 1 and 2 share two species in common, while Site 3 is all by its one-sies. If you calculate the Euclidean distances between these sites, you get:

\begin{bmatrix} 0 & 7.62 & 9 \\ 7.62 & 0 & 1.73 \\ 9 & 1.73 & 0 \end{bmatrix}

Sites 2 and 3 are more similar than Sites 1 and 2, even though Site 3 shares no species in common!  Let’s try it on the chord distances. Doing that, we get:

\begin{bmatrix} 0 & 0.32 & 1.41 \\ 0.32 & 0 & 1.41 \\ 1.41 & 1.41 & 0 \end{bmatrix}

That’s better. Now Site 3 is equally distant from both Sites 1 and 2 since it shares no species in common with either of them. So what the hell? This is why it’s termed a paradox. But if I’ve learned anything by watching the iTunes U lecture of Harvard Stats 110 (Thanks Joel!), it’s that anything called a paradox just means you haven’t thought about it long enough. Here’s a hint: the answer isn’t that Euclidean distance counts double zeroes while Chord does not, as shown above. Especially since Chord is Euclidean, it uses the exact same equation.

The answer is actually much simpler, and non-mathy. Euclidean distances on raw abundance values place a premium on differences in the number of individuals, not species. So it’s actually getting it right. Sites 2 and 3 have 2 and 1 individuals total, respectively. When you take the difference, you’re basically counting up the number of individuals the sites do not share. In that case, it happens to be that Sites 2 and 3 only have three individuals that differ between them. Sites 1 and 3 have 13 individuals that differ between them, and Sites 1 and 2 have 10 individuals that differ between them. So by this math, Sites 2 and 3 actually should be really similar.

Chord distances (and \chi^2 distances, and others) standardize the data, taking differences in total abundances out of the equation. Instead, it compares how individuals are distributed across species. Since all of Sites 3 is in the first species, and Sites 1 and 2 distributed their individuals in the second and third species, obviously Sites 1 and 2 will be more similar. This is why McCune and Grace even say that Euclidean distances on relativized species abundances is OK. If you want to compare species composition using Euclidean distances, you need to first take differences in abundances out of the question. All of the other ‘non-zero-counting’ distances more or less do the same thing.

If your question is how sites vary in both abundance AND species composition, then Euclidean distance is probably OK. Just don’t use PCA on species abundances. Ever.

By the way, the iTunes U Harvard Stats 110 series is awesome, and Joel Blitzstein is a great lecturer. Totally worth the time to watch all the lectures. And its free.

Python code for the above is here:


import numpy as np

x1 = np.array([0, 4, 8])
x2 = np.array([0, 1, 1])
Euc_D = np.sqrt( (x1-x2).dot(x1-x2) )

n1 = x1/np.sqrt( x1.dot(x1) )
n2 = n2/np.sqrt( x2.dot(x2) )
Chord_D = np.sqrt( (n1-n2).dot(n1-n2) )

x1_2 = np.append(x1, np.zeros(5))
x2_2 = np.append(x2, np.zeros(5))
Euc_D2 = np.sqrt( (x1_2 - x2_2).dot(x1_2 - x2_2) )

n1_2 = x1_2 / np.sqrt(x1_2.dot(x1_2))
n2_2 = x2_2 / np.sqrt(x2_2.dot(x2_2) )
Chord_D2 = np.sqrt((n1_2 - n2_2).dot(n1_2 - n2_2))

x3 = np.array([1, 0, 0])
Sites = np.array([x1, x2, x3])
Euc_M = np.zeros([3, 3])
for i in xrange(3):
    for j in xrange(3):
        Euc_M[i,j] = np.sqrt((Sites[i,:] - Sites[j,:]).dot( Sites[i,:] - Sites[j,:] ) )

Chord_Sites = np.apply_along_axis(lambda x: x/np.sqrt(x.dot(x)), 1, Sites )
for i in xrange(3):
    for j in xrange(3):
        Chord_M[i,j] = np.sqrt( (Chord_Sites[i,:] - Chord_Sites[j,:]).dot( Chord_Sites[i,:] - Chord_Sites[j,:] ) )
Advertisements

8 thoughts on “An intuitive explanation for the ‘double-zeroes’ problem with Euclidean distances

  1. Pingback: Species abundances, raw abundances, and species composition | Hypergeometric

  2. Thank you very much for this post which finally answered the question we had in our lab ! But why not using PCA on species abundances ?

    • Well there are a couple of reasons:

      1) PCA assumes linear relationships between its variables, and species abundances are almost never linearly related. So PCA may not accurately characterize similarity between sites. This is the biggest problem with PCA and abundances.

      2) As described here, most of the time we are interested in how community composition varies among sites, independent of site abundance. So two sites with identical species composition, but just more individuals, could be depicted as being quite different when actually they are the same. To counter this, you could use relative abundances, but this doesn’t free you from issues #1.

      3) PCA assigns as much weight to double absences (i.e. species absent from both sites) as double presences (i.e. species present at both sites). But a species can be absent for both sites for many reasons (maybe the sites are at opposite ends of an environmental gradient, or the species just didn’t disperse there). Double absences tell you relatively little about site similarity, whereas double presences tell you a great deal (i.e. both sites are suitable for the species). So a metric like Bray-Curtis, which only considers double presences, followed by PCoA or nMDS would be better.

      Hope this helps!

      • Yes it does !
        If I may go further, I am not sure about the distinction between PCoA and nMDS. I bet that the difference is based on the linear relationships between variables assumed in PCoA, isn’t it ?

      • Thanks for your helpful infos!
        a) double-presences only have the same info for Euclidean dist if they have the same abundance value – else double-presence even lead to a larger Euclidean distance than a double-zero, right?
        b) just to be sure: double-absence = double-zero (?)
        c) the double-zero-problem is not the same as the species abundance paradoxon, e.g. in the example from Legendre & L., there could be a 1 in the top-left cell, leaving us with an example with no double-zeros, but we still have the spec abund paradoxon, right? Seems to me that L. & L. use “double-zero” sometimes synonymously, e.g. after the example you give (p. 300 in my 3rd paperback edition of L.&L.). You made this distinction clear!
        d) deduced from this, it would not be enought to have no/few double-zeros in a species abundance table to use PCA or RDA, right?
        e) Legendre, P. & Gallagher, E.D. (2001) Ecologically meaningful transformations for ordination of species data. Oecologia 129, 271–280. suggest some transformations that may/should/can make species abundance data amenable to PCA, RDA etc.

      • Good comments! My biggest issue is this: people say that Euclidean distances aren’t appropriate for abundance data because they count “double zeros” (which are the same as double-absence, you are correct) is kind of misleading because there are a number of metrics that count “double zeros”, as I explained in the post. For example, if you calculate Euclidean distance on Hellinger transformed data (i.e. the chord distance), that’s ok, even though the underlying math is identical. This just doesn’t make sense.

        My take home message is that your point a) is correct, if I’m reading it correctly. Double-presences for Euclidean distance can be inflated by differences in total abundances between sites. You can imagine two sites that have the exact same relative abundance of two species, except one site has twice the number of animals. The Euclidean distance will say that the two sites are quite different. The chord distance (or Euclidean distance on relative abundances, which is also OK) will say they are identical. Which is correct? That depends on your question and whether you find differences in abundances meaningful.

  3. Nope! PCoA works on any distance metric. If you do PCoA on Euclidean distances, you get the same exact result as PCA. But you can do PCoA on Bray-Curtis, Canberra, or whatever distance metric you choose. Same with nMDS.

    The difference is that PCoA works via eigen decomposition, and therefore can only represent metric distances. If you use a non-metric distance, like Bray-Curtis, there is no guarantee you’ll get the ideal representation of points. nMDS is guaranteed to give you the best representation regardless of distance metric.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s