Ecological Dynamic Programming Optimization in Python

It’s been awhile since I’ve posted anything. I’ll use the excuse that I’ve been busy, but mostly I just forget. Regardless, I’m learning how to do Dynamic Programming Optimization (DPO), which sounds more complex than it is. The reason for this is that DPO allows us to simulate the behavior of individuals who make decisions based on current patch quality.  DPO is an exciting tool that forms the foundation of individual-based models, which allow us to assess community and ecosystem dynamics as emergent properties of individual decisions based on well-grounded, physiological principles (hence my interest).The underlying idea, as I understand it, is that individuals assess their future reproductive output prior to making a decision (I’m interested in optimal foraging, so I’ll put this in the context of foraging). We can model how individuals make decisions over the course of a lifetime, and track each individual, which then allows us to make quantitative statements about populations, communities, or ecosystems.

This all sounds complicated, and it can be difficult. So it’s easiest to jump right in with an example. Here’s the code, along with a basic explanation of what’s going on, for the very first toy example found in “Dynamic State Variable Models in Ecology” by Clark and Mangel. In the book, they give the computer code in TRUEBASIC, but… in all honesty.. no one I know uses that. I could use R, but we all know my feelings on that. So here’s their toy model programmed and annotated in Python.

Background

Suppose we’re interested in the behavior of fish and how they choose to distribute themselves among three different patches. Two patches are foraging patches and one is a reproduction patch. The first thing we need to do is make an objective fitness function F. In this example, F relates body size to reproductive output. It can be thought of as what the organism believes about its final fitness given a particular body size. Let’s make it increasing, but asymptotic, with body mass:

F = \frac{A(x-x_c)}{x-x_c+x_0}

Here, A is sort of a saturation constant beyond which fitness increases minimally with body size, x_c is the body size at which mortality occurs (0), and x_0 is the initial body size. Lets set x_c=0, A=60 and x_0=0.25*x_{max}, where x_{max}=30 is the individuals maximum possible body size. You have to set the ceiling on body size otherwise organisms grow without bounds.

OK so that’s what the organisms “believes” about its fitness at the end of a season given a specific body mass. The question is: How does the organism forage optimally to maximize fitness at the end of the its lifetime?

The obvious answer would be to simulate a foraging decision at every time step moving forward, and then have it decide again at the new time step, etc. This is computationally expensive, so to circumvent this we work backwards from the end of the season. This saves time because there are far fewer paths to a known destination than there are to an unknown one (essentially).

So we imagine that, at the final time step t_f, the organism’s fitness is given by F for any given body mass.

import numpy as np
import matplotlib.pyplot as pet
import pandas as pd

n_bodysize = 31 # 31 body sizes, from 0-30
n_time = 20 # 20 time steps, 1-20
n_patch = 3, # 3 foraging patches

# make a container for body size (0-30) and time (1-20)
F = np.zeros((n_bodysize, n_time))

# make the function F
x_crit = 0
x_max = 30
acap = 60
x_0 = 0.25*x_max 
t_max = 20

# calculate organism fitness at the final time point
for x in range(x_crit, x_max+1):
    F[x,-1] = acap*(x-x_crit)/(x-x_crit+x_0)
[/code]

Now that we know fitness at the final time point, we can iterate through backwards (called backwards simulation) to decide what the optimal strategy is to achieve each body mass. To determine that, we need to know the fitness in each patch. Let’s start with the two foraging patches, Patch 1 and Patch 2. We need to know four things about each patch: (1) the mortality rate in each patch (m), (2) the probability of finding food in each patch (p), (3) the metabolic cost of visiting a patch (a), and (4) the gain in body mass if food is successfully found (y). For these two patches, let:

Patch 1: m=0.01, p=0.2, a=1, y=2

Patch 2: m=0.2, p=0.5, a=1, y=4

Right away we can see that Patch 2 is high-risk, high reward compared to Patch 1. In each patch, we calculate the next body size given that an animal does (x‘) or does not (x”) find food:

x' = x-a_i+y_i

x'' = x-a_i

Those are simple equations. Body size is the current body size minus the metabolic cost of foraging in the patch and, if successful, the energy gain from foraging. Great. Now we can calculate the expected fitness of each patch as the weighted average of F‘ and F” given the probability of finding food, all times the probability of actually surviving. For these two patches, we make a fitness function (V):

V_i = (1 - m_i)*[p_i*F_{t+1}(x') + (1-p_i)*F'_{t+1}(x'')]

The reproductive patch is different. There is no foraging that occurs in the reproductive patch. Instead, if the organism is above a critical mass x_{rep}, then it devotes all excess energy to reproduction to a limit  (c=4). If the organism is below the reproductive threshold and still visits the foraging patch, it simply loses mass (unless it dies).

OK this is all kind of complicated, so let’s step through it. We know what fitness is at the final time step because of F. So let’s step back one time step. At this penultimate time step, we go through every body mass and calculate fitness for each patch. Let’s do an example. If x=15, then we need to know fitness in Patch 1, Patch 2, and Patch 3. For Patches 1 and 2, we need to know the weight gain if successful and the weight gain if unsuccessful.

x' = max(15-a_1+y_1, x_{max})

x' = 15-1+2 = 16

x''=min(15-a_1, x_{c})

x'' = 15-1 = 14

The min and max functions here just make sure our organism doesn’t grow without limit and dies if metabolic cost exceeds body mass. So these are now the two potential outcomes of foraging in Patch 1 given x=15. The expected fitness of these two body masses is given as F(16) and F(14). Plug all these values into equation V for Patch 1 to get the expected fitness of Patch 1 at a body size of 15. We then take the maximum for all three Patches, save whichever Patch corresponds to that as the optimal foraging decision, and then save as the fitness for body size 15 at that time step. So at body size 15, for Patch 1 is  39, for Patch 2 is 38, and for Patch 3 is 37.6, so the individual will foraging in Patch 1 and now fitness for body size 15 at this time step is 39.

Repeat this procedure for every possible body size, and you’ll get the fitness for every body size at the second to last time step as well as the optimal foraging patch for every body size at that time.

Then, step backwards in time. Repeat this whole procedure, except now the value for each body mass doesn’t come from the equation F, but comes from the fitness we just calculated for each body mass. So for example, if an organism’s foraging decisions at this time step lead it to a body mass of 15, then F is now 39. Again, repeat this for every body mass, and then step back, etc.

Here’s the full Python code for how this is done:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

n_bodysize = 31 # 31 body sizes, 0 - 30
n_time = 20 # 20 time steps, 1 - 20
n_patch = 3 # number of patches

#%% CONATINERS
# make a container for body size (0-30) and time (1-20)
F = np.zeros((n_bodysize, n_time))
# make a container for three patches, each with body size (0-30) and time (1-20)
Vi = np.zeros((n_patch, n_bodysize, n_time))
# make a container for the optimal decision
d = np.zeros((n_bodysize, n_time))

# make a container for the mortality rates in each patch
m = np.zeros(n_patch)
# make a container for the probability of finding food in each path
p = np.zeros(n_patch)
# make a container for the metabolic cost
a = np.zeros(n_patch)
# make a container for food gain in each patch
y = np.zeros(n_patch)
# make a container for reproductive output in each patch
c = np.zeros(n_patch)

#%% CONDITIONS
x_crit = 0
x_max = 30
t_max = 20
x_rep = 4

#%% PARAMETERS
m[0] = 0.01; m[1] = 0.05; m[2] = 0.02
p[0] = 0.2; p[1] = 0.5; p[2] = 0
a[0] = 1; a[1] = 1; a[2] = 1
y[0] = 2; y[1] = 4; y[2] = 0
c[0] = 0; c[1] = 0; c[2] = 4

#%% END CONDITION
acap = 60
x_0 = 0.25*x_max

# Calculate Fitness for every body mass at the final time step
for x in range(x_crit, x_max+1):
    F[x,-1] = acap*(x-x_crit)/(x-x_crit+x_0) # maximum reproductive output for each body size at the final time

#%% SOLVER
for t in range(0, t_max-1)[::-1]: # for every time step, working backward in time #print t
    for x in range(x_crit+1, x_max+1): # iterate over every body size # print x
        for patch in range(0, 3): # for every patch
            if patch in [0,1]: # if in a foraging patch
                xp = x-a[patch]+y[patch] # the updated body size
                 xp = min(xp, x_max) # constraint on max size
                 xpp = int(x-a[patch]) # updated body size if no food
                 Vi[patch,x,t] = (1 - m[patch])*(p[patch]*F[int(xp),t+1] + (1-p[patch])*F[xpp, t+1]) # Calculate expected fitness for foraging in that patch
             else:
                 if x < x_rep: # in a reproduction patch
                     xp = max(x-a[patch], x_crit)
                     Vi[patch, x, t] = (1-m[patch])*F[int(xp), t+1]
                 else:
                     fitness_increment = min(x-x_rep, c[patch]) # resources devoted to reproduction
                     xp = max(x-a[patch]-fitness_increment, x_crit) # new body size is body size minus metabolism minus reproduction
                     Vi[patch, x, t] = fitness_increment + (1-m[patch])*F[int(xp),t+1]
        vmax = max(Vi[0,x,t], Vi[1,x,t])
        vmax = max(vmax, Vi[2,x,t])
        if vmax==Vi[0,x,t]:
                 d[x,t] = 1
        elif vmax==Vi[1,x,t]:
                 d[x,t] = 2
        elif vmax==Vi[2,x,t]:
                 d[x,t] = 3
        F[x,t] = max # the expected fitness at this time step for this body mass
[/code]

This model doesn’t really track individual behavior. What it does is provides an optimal decision for every time and every body mass. So we know what, say, an individual should do if it finds itself small towards the end of its life, or if it finds itself large at the beginning:

T,X = np.meshgrid(range(1, t_max+1), range(x_crit, x_max+1))
df = pd.DataFrame({'X': X.ravel(), 'T': T.ravel(), 'd': d.ravel()})
df = df.pivot('X', 'T', 'd')
sns.heatmap(df, vmin=1, vmax=3, annot=True, cbar_kws={'ticks': [1,2,3]})
plt.gca().invert_yaxis()
plt.show()
[/code]

dynamic_opt.png

And that’s that!! To be honest, I still haven’t wrapped my head around this fully. I wrote this blog post in part to make myself think harder about what was going on, rather than just regurgitating code from the book.

An intuitive explanation for the ‘double-zeroes’ problem with Euclidean distances

First, some background. Given a multivariate dataset with a large number of descriptor variables (i.e. columns in the matrix), ecologists (and others) often try to distill all of the descriptors into a single metric describing the relatedness of the objects in the matrix (i.e. rows). They usually do this by calculating one of many ‘distance’, ‘similarity’, or ‘dissimilarity’ metrics, all of which have various properties. Commonly in ecology, this is done for site x species matrices, where ecologists attempt to describe how sites are related to one another based on community composition. By far the most common is Euclidean distance. It follows from the Pythagorean theorem. Suppose we have two sites, or rows, called ‘1’ and ‘2’, because I’m feeling creative. Then site 1 is a vector \mathbf{x_1} with one entry per species, same for site ‘2’ \mathbf{x_2}. The euclidean distance is the sum of squared differences between the two sites:

\sqrt{ \sum_1^n (x_{1n} - x_{2n})^2 }

or in vector notation:

\sqrt{ (\mathbf{x_1} - \mathbf{x_2})'(\mathbf{x_1}-\mathbf{x_2}) }

We square so far?

The common criticism of Euclidean distances is that it ‘counts double zeros’, so that species absent from both sites actually lead to sites being more similar than otherwise. A number of other metrics, like the chord distance, don’t have this problem. The chord distance is the Euclidean distance of normalized vectors. Define \mathbf{n_1} as the normalized vector of Site 1 \mathbf{x_1} and the same for Site 2.

\mathbf{n_1} = \frac{\mathbf{x_1}}{\sqrt{\mathbf{x_1'x_1}}}

and so on for Site 2. Then, the chord distance is identical to the Euclidean distance above:

\sqrt{ (\mathbf{n_1} - \mathbf{n_2})'(\mathbf{n_1} - \mathbf{n_2}) }

The question I’ve always had is this.. how can the Euclidean distance count double zeroes while the chord distance, which is Euclidean does not? The answer is that neither of them do. You can add as many double zeroes to either vector and the distance does not change. For example, imagine two sites with three species \mathbf{x_1} = [0, 4, 8] and \mathbf{x_2} = [0, 1, 1]. The Euclidean distance for these two sites is 7.6158. The chord distance for these sites is 0.3203. Now, let’s tack on 5 zeroes to each site (5 double zeroes). Amazingly, both the Euclidean and chord distances are unchanged. This is because the zeros cancel out (0-0)^2 = 0, so they contribute nothing to the distance. This is the same rationale that Legendre and Legendre give in Numerical Ecology for why double zeroes do not contribute to chi-square metrics, yet the same applies for Euclidean distances.

So what’s the deal with Euclidean distance and double zeroes? Obviously the zeroes cancel, just as in other metrics. The issue comes up when you use Euclidean distances on raw abundances and attempt to make inference about species composition, which leads to the so-called paradox of Euclidean distances. Let’s take the example matrix:

\begin{bmatrix} 0 & 4 & 8 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \end{bmatrix}

Sites 1 and 2 share two species in common, while Site 3 is all by its one-sies. If you calculate the Euclidean distances between these sites, you get:

\begin{bmatrix} 0 & 7.62 & 9 \\ 7.62 & 0 & 1.73 \\ 9 & 1.73 & 0 \end{bmatrix}

Sites 2 and 3 are more similar than Sites 1 and 2, even though Site 3 shares no species in common!  Let’s try it on the chord distances. Doing that, we get:

\begin{bmatrix} 0 & 0.32 & 1.41 \\ 0.32 & 0 & 1.41 \\ 1.41 & 1.41 & 0 \end{bmatrix}

That’s better. Now Site 3 is equally distant from both Sites 1 and 2 since it shares no species in common with either of them. So what the hell? This is why it’s termed a paradox. But if I’ve learned anything by watching the iTunes U lecture of Harvard Stats 110 (Thanks Joel!), it’s that anything called a paradox just means you haven’t thought about it long enough. Here’s a hint: the answer isn’t that Euclidean distance counts double zeroes while Chord does not, as shown above. Especially since Chord is Euclidean, it uses the exact same equation.

The answer is actually much simpler, and non-mathy. Euclidean distances on raw abundance values place a premium on differences in the number of individuals, not species. So it’s actually getting it right. Sites 2 and 3 have 2 and 1 individuals total, respectively. When you take the difference, you’re basically counting up the number of individuals the sites do not share. In that case, it happens to be that Sites 2 and 3 only have three individuals that differ between them. Sites 1 and 3 have 13 individuals that differ between them, and Sites 1 and 2 have 10 individuals that differ between them. So by this math, Sites 2 and 3 actually should be really similar.

Chord distances (and \chi^2 distances, and others) standardize the data, taking differences in total abundances out of the equation. Instead, it compares how individuals are distributed across species. Since all of Sites 3 is in the first species, and Sites 1 and 2 distributed their individuals in the second and third species, obviously Sites 1 and 2 will be more similar. This is why McCune and Grace even say that Euclidean distances on relativized species abundances is OK. If you want to compare species composition using Euclidean distances, you need to first take differences in abundances out of the question. All of the other ‘non-zero-counting’ distances more or less do the same thing.

If your question is how sites vary in both abundance AND species composition, then Euclidean distance is probably OK. Just don’t use PCA on species abundances. Ever.

By the way, the iTunes U Harvard Stats 110 series is awesome, and Joel Blitzstein is a great lecturer. Totally worth the time to watch all the lectures. And its free.

Python code for the above is here:


import numpy as np

x1 = np.array([0, 4, 8])
x2 = np.array([0, 1, 1])
Euc_D = np.sqrt( (x1-x2).dot(x1-x2) )

n1 = x1/np.sqrt( x1.dot(x1) )
n2 = n2/np.sqrt( x2.dot(x2) )
Chord_D = np.sqrt( (n1-n2).dot(n1-n2) )

x1_2 = np.append(x1, np.zeros(5))
x2_2 = np.append(x2, np.zeros(5))
Euc_D2 = np.sqrt( (x1_2 - x2_2).dot(x1_2 - x2_2) )

n1_2 = x1_2 / np.sqrt(x1_2.dot(x1_2))
n2_2 = x2_2 / np.sqrt(x2_2.dot(x2_2) )
Chord_D2 = np.sqrt((n1_2 - n2_2).dot(n1_2 - n2_2))

x3 = np.array([1, 0, 0])
Sites = np.array([x1, x2, x3])
Euc_M = np.zeros([3, 3])
for i in xrange(3):
    for j in xrange(3):
        Euc_M[i,j] = np.sqrt((Sites[i,:] - Sites[j,:]).dot( Sites[i,:] - Sites[j,:] ) )

Chord_Sites = np.apply_along_axis(lambda x: x/np.sqrt(x.dot(x)), 1, Sites )
for i in xrange(3):
    for j in xrange(3):
        Chord_M[i,j] = np.sqrt( (Chord_Sites[i,:] - Chord_Sites[j,:]).dot( Chord_Sites[i,:] - Chord_Sites[j,:] ) )

Ecological Lexicon

Hey everyone,

Some grad student friends of mine and I have been digging through the ecological literature to determine how ecologists use different terms (community, assemblage, guild, and ensemble) and whether their usage differs from that of definitions in textbooks and other sources. It can be fairly confusing, seeing as how many terms are often used interchangeably and defined differently depending on what textbook you used as an undergraduate or graduate student. There have been previous attempts to synthesize these definitions, but it’s unclear how successful they were. We’ve conducted a literature and textbook survey of the usage of these terms, but we also want to couple this with actual data on how ecologists and practitioners define these terms themselves to see if some of the confusion in the literature results from confusion amongst ecologists about the definitions. Please try to limit answers to one or two sentences. Thanks for your participation! Karma is your great reward.

What Does Open-Access Data Really Mean?

I’ve recently been digging through the internet looking for data on plant traits (things like seed mass, SLA, etc.). I thought this would be relatively simple given the recent push towards open-access data repositories and the arrival of ‘big data ecology‘. Data repositories like Dryad or The Knowledge Network for Biocomplexity aim to make finding raw data easy. Other websites, like TraitNet, LEDA, or TRY aim to compile all of the existing data on a particular subject (like plant traits) into a single location, vet the data, standardize collection practices, and provide the data for use. Other institutions, like NCEAS, appear to have requirements that the data be made publicly available and have their own public, searchable repository.

My question is: After trying unsuccessfully for several days to actually get any of this data, what does open access data really mean? Personally, I pictured a searchable database that would pull up data related to your search terms, provide metadata to determine if the data is useful, and then you click download and *poof*, data on your hard drive. This turned out to be pretty rare.

Some of those databases contain no data, they simply provide links to other databases. Data from Dryad or KNB is spotty: sometimes you can download it, sometimes its listed but not publicly available, sometimes it takes you to another website which asks for a login ID and requires author permission to use. That last bit is common: Data is posted, you are then redirected to a secondary website, requiring a login to access, possibly requiring author permission to use (which requires emailing the author and getting a response), and then, sometimes, requiring uploading data of your own (i.e. TRY), which is hard for people concentrating on meta-analyses.

I guess my complaint is just that, when I saw open-access databases, I imagined searching, browsing, and downloading. It really seems like we’re not quite there yet, there are still dozens of hoops to jump through that make actually getting your hands on the data difficult. One could argue that the author of the data should always have a final say as to whether the data is downloaded. In effect, they do. By uploading the data, the authors are implicitly acknowledging that they have gotten their publications from the data, and while they might still make use of it, its available for anyone else. If the data is still being worked, the authors can always choose not to upload it.

I like the premise of open-access data a lot (its why I make all my data and code available on my website), but I don’t think we’ve quite gotten the spirit of it yet.

My Recent Ecology Presentation at ESA Using LaTeX Beamer

Gallery

This gallery contains 24 photos.

I’m currently in Minneapolis at the annual Ecological Society of America (ESA) Meeting. I just gave my talk yesterday. I’m always experimenting with new presentation software (I dislike Powerpoint, I love Pages but no one supports it) and styles. I’ve … Continue reading