PyStan: A Second Intermediate Tutorial of Bayesian Analysis in Python

I promised a while ago that I’d give a more advanced tutorial of using PySTAN and Python to fit a Bayesian hierarchical model. Well, I’ve been waiting for a while because the paper was in review and then in print. Now, it’s out and I’m super excited! My first pure Python paper, using Python for all data manipulation, analysis, and plotting.

The question was whether temperature affects herbivory by insects in any predictable way. I gathered as many insect species as I could and fed them whatever they ate at multiple temperatures. Check the article for more detail, but the idea was to fit a curve to all 21 herbivore-plant pairs as well as to estimate the overall effect of temperature. We also suspected (incorrectly as it turns out) that plant nutritional quality might be a good predictor of the shape of these curves, so we included that as a group-level predictor.

Anyway, here’s the code, complete with STAN model, posterior manipulations, and some plotting. First, here’s the actual STAN model. NOTE: a lot of data manipulation and whatnot is missing. The point is not to show that but to describe how to fit a STAN model and work with the output. Anyone who wants the full code and data to work with can find it on my website or in Dryad (see the article for a link).

stanMod = """
int<lower = 1> N;
int<lower = 1> J;
vector[N] Temp;
vector[N] y;
int<lower = 1> Curve[N];
vector[J] pctN;
vector[J] pctP;
vector[J] pctH20;
matrix[3, 3] R;

vector[3] beta[J];
real mu_a;
real mu_b;
real mu_c;
real g_a1;
real g_a2;
real g_a3;
real g_b1;
real g_b2;
real g_b3;
real g_c1;
real g_c2;
real g_c3;
real<lower = 0> sigma[J];
cov_matrix[3] Tau;

transformed parameters{
vector[N] y_hat;
vector[N] sd_y;
vector[3] beta_hat[J];
// First, get the predicted value as an exponential curve
// Also make a dummy variable for SD so it can be vectorized
for (n in 1:N){
y_hat[n] <- exp( beta[Curve[n], 1] + beta[Curve[n], 2]*Temp[n] + beta[Curve[n], 3]*pow(Temp[n], 2) );
sd_y[n] <- sigma[Curve[n]];
// Next, for each group-level coefficient, include the group-level predictors
for (j in 1:J){
beta_hat[j, 1] <- mu_a + g_a1*pctN[j] + g_a2*pctP[j] + g_a3*pctH20[j];
beta_hat[j, 2] <- mu_b + g_b1*pctN[j] + g_b2*pctP[j] + g_b3*pctH20[j];
beta_hat[j, 3] <- mu_c + g_c1*pctN[j] + g_c2*pctP[j] + g_c3*pctH20[j];

y ~ normal(y_hat, sd_y);
for (j in 1:J){
beta[j] ~ multi_normal_prec(beta_hat[j], Tau);
mu_a ~ normal(0, 1);
mu_b ~ normal(0, 1);
mu_c ~ normal(0, 1);
g_a1 ~ normal(0, 1);
g_a2 ~ normal(0, 1);
g_a3 ~ normal(0, 1);
g_b1 ~ normal(0, 1);
g_b2 ~ normal(0, 1);
g_b3 ~ normal(0, 1);
g_c1 ~ normal(0, 1);
g_c2 ~ normal(0, 1);
g_c3 ~ normal(0, 1);
sigma ~ uniform(0, 100);
Tau ~ wishart(4, R);

# fit the model!
fit = pystan.stan(model_code=stanMod, data=dat,
 iter=10000, chains=4, thin = 20)

Not so bad, was it? It’s actually pretty straightforward.

After the model has been run, we work with the output. We can check traceplots of various parameters:

fit.plot(['mu_a', 'mu_b', 'mu_c'])
fit.plot(['g_a1', 'g_a2', 'g_a3'])
fit.plot(['g_b1', 'g_b2', 'g_b3'])
fit.plot(['g_c1', 'g_c2', 'g_c3'])

As a brief example, we can extract the overall coefficients and plot them:

mus = fit.extract(['mu_a', 'mu_b', 'mu_c'])
mus = pd.DataFrame({'Intercept' : mus['mu_a'], 'Linear' : mus['mu_b'], 'Quadratic' : mus['mu_c']})

py.plot(mus.median(), range(3), 'ko', ms = 10)
py.hlines(range(3), mus.quantile(0.025), mus.quantile(0.975), 'k')
py.hlines(range(3), mus.quantile(0.1), mus.quantile(0.9), 'k', linewidth = 3)
py.axvline(0, linestyle = 'dashed', color = 'k')
py.xlabel('Median Coefficient Estimate (80 and 95% CI)')
py.yticks(range(3), ['Intercept', 'Exponential', 'Gaussian'])
py.ylim([-0.5, 2.5])
py.title('Overall Coefficients')

The resulting plot:


We can also make a prediction line with confidence intervals:

#first, define a prediction function
def predFunc(x, v = 1):
 yhat = np.exp( x[0] + x[1]*xPred + v*x[2]*xPred**2 )
 return pd.Series({'yhat' : yhat})

# next, define a function to return the quantiles at each predicted value
def quantGet(data , q):
 quant = []
 for i in range(len(xPred)):
 val = []
 for j in range(len(data)):
 val.append( data[j][i] )
 quant.append( np.percentile(val, q) )
 return quant

# make a vector of temperatures to predict (and convert to the real temperature scale)
xPred = np.linspace(feeding_Final['Temp_Scale'].min(), feeding_Final['Temp_Scale'].max(), 100)
realTemp = xPred * feeding_Final['Temperature'].std() + feeding_Final['Temperature'].mean()

# make predictions for every chain (in overall effects)
ovPred = mus.apply(predFunc, axis = 1)

# get lower and upper quantiles
ovLower = quantGet(ovPred['yhat'], 2.5)
ovLower80 = quantGet(ovPred['yhat'], 10)
ovUpper80 = quantGet(ovPred['yhat'], 90)
ovUpper = quantGet(ovPred['yhat'], 97.5)

# get median predictions
ovPred = predFunc(mus.median())

Then, just plot the median (ovPred) and the quantiles against temperature (realTemp). With just a little effort, you can wind up with something that looks pretty good:


I apologize for only posting part of the code, but the full script is really long. This should serve as a pretty good start for anyone looking to use Python as their Bayesian platform of choice. Anyone interested can get the data and full script from my article or website and give it a try! It’s all publicly available.

Ecological Lexicon

Hey everyone,

Some grad student friends of mine and I have been digging through the ecological literature to determine how ecologists use different terms (community, assemblage, guild, and ensemble) and whether their usage differs from that of definitions in textbooks and other sources. It can be fairly confusing, seeing as how many terms are often used interchangeably and defined differently depending on what textbook you used as an undergraduate or graduate student. There have been previous attempts to synthesize these definitions, but it’s unclear how successful they were. We’ve conducted a literature and textbook survey of the usage of these terms, but we also want to couple this with actual data on how ecologists and practitioners define these terms themselves to see if some of the confusion in the literature results from confusion amongst ecologists about the definitions. Please try to limit answers to one or two sentences. Thanks for your participation! Karma is your great reward.

Saturation Science

The other day I introduced our upcoming Aquarius mission, so I thought today I would take a few minutes to introduce the science behind it all. While the opportunity to live underwater and dive nine hours a day seems like reason enough to me, it can get pricey to pull off such an endeavor and “because it’s awesome” typically doesn’t get a project funded. In my previous post I mentioned  I’d be studying herbivory and grazing across a depth gradient so I thought I would briefly discuss why it’s important to understand.

I won’t spend much time here arguing why reefs are important; suffice to say, if you don’t appreciate the intrinsic value of a reef, or that ~ 10% of the world’s population depends on them for food, you may appreciate the services they provide, such as tourism, fisheries, coastal protection and natural products, which could be worth as much as $375 billion per year. Despite their various values, coral reefs are in bad shape these days and their prognosis looks bleak. Warming oceans, overfishing, changes in seawater chemistry, and more severe tropical storms are just a few of the factors that have contributed to the loss of as much as a third of world’s coral reefs in just a decade. When these slow-growing corals die on shallow reefs, faster- growing algae quickly take their place. These algae smother and kill the remaining corals, while preventing new coral from settling. The job of preventing this transition from a coral- to algae-covered reef falls to coral reef herbivores such as urchins and parrotfishes, which eat algae and maintain the open space for swimming coral larvae to settle and grow. Unfortunately, due to human impacts the speed at which corals are now dying is so fast that herbivore populations, already decimated by overfishing and disease, are often unable to keep pace with the algae growth.

A time series taken by geologist Eugene Shinn showing the same location in the Florida Keys from 1959 - 1998.

A time series taken by geologist Eugene Shinn showing the same location in the Florida Keys from 1959 – 1998.

But there’s still hope for the future. In addition to traditional strategies, such as better management of shallow reefs, deep-water or “mesophotic” coral reefs that grow between 30 m and the depths that light can no longer penetrate, are gaining attention from researchers as naturally protected “source” populations for surrounding areas. Because the deep waters above these reefs can provide a natural buffer from threats such as artisanal fishing, storm damage, and the high-temperatures that trigger coral bleaching, these reefs are naturally protected from many of threats humans have created for shallower reefs. As a result, these deep reefs may act as a sort of self-sufficient Marine Protected Area, where species survive and their offspring eventually repopulate the surrounding shallower reefs. Due to the technical difficulties of working at depth, researchers are only recently beginning to understand the role of these deep coral reefs and the processes important to their continued health. While we know herbivory is important for shallower reefs, declining numbers of herbivores at depth, coupled with slower rates of algae growth, have created uncertainty in how important herbivory is to maintaining the health of deep-water coral reefs. And that’s where our science fits in. Starting on Sunday we will begin documenting herbivore communities, feeding patterns and algae growth to understand the role of herbivores in structuring these important systems.

Time for Miami to Revamp its Public Transportation System

Lackluster Transportation in the Magic City

Miami-Dade suffers from some of the worst driving conditions in the United States. Numerous bottlenecks along I-95, the only traffic corridor connecting metropolitan South Florida, bring traffic to a standstill at nearly all hours of the day. Traffic analysts at INRIX consistently rank among the top 15 most congested cities in the U.S. Congestion is only part of the issue, however. The Florida stretch of I-95 incurs more fatalities per year than any other interstate segment in the country. Many of these fatalities occur in Miami-Dade, which 242 traffic-related deaths in 2011, more than any other Florida county. Local troopers blame driver distraction (e.g. texting) coupled with aggressive driving. Meanwhile, Florida legislature will not prohibit texting while driving in any meaningful way; the recent anti-texting law treats texting as a ‘secondary offence’, punishable only if the driver has already been pulled over for another offence. Given these factors, being off the road in Miami is far better, and safer, than being on it.


A typical day on the Miami freeway

Unfortunately, Miami provides one of the worst public transportation systems of any major metropolitan area in the U.S. Even some cities in South America, like Bogata, Colombia, have more comprehensive public transportation (and bike lanes). Numerous Miami-Dade Transit (MDT) websites extoll the virtues of public transportation in the greater Miami area. To the uninformed, these websites can be deceptive.

The MDT bus system is chaotic and woefully inefficient. A trip that takes 20 minutes by car requires over two hours, at best, in the MDT bus system. One prime is example is the route between Coconut Grove and FIU’s main campus on 8th St. In part, this is because there are bus stops every other block. Contrast this with efficient busing in Houston, Washington, or even Bogata, where pick-up locations are further apart and dedicated bus lanes included along many heavily trafficked routes. In those cities, transfers among routes operate efficiently, providing timely access to nearly anywhere in the city. Despite these inefficiencies, last year the federal government donated $10 million to upgrade MDT’s bus fleet with hybrid buses. With brand new buses, MDT should (re)organize routes, making buses a viable transport option for the numerous taxpayers in need of car-free alternatives.

Metro Light

Stunning in both its breadth and complexity

Stunning in both its breadth and complexity

Miami also has an extremely underdeveloped rail system. The MetroRail was completed in 1984, costing over $1 billion (the Federal Transit Authority (FTA) footed roughly 80% of the bill). For that sum, Miami received one rail line between South Miami, through downtown, and then west to Hialeah and Palmetto, with no stop at Miami International Airport. Whether an oversight or deliberately poor planning, it took 25 years to rectify this mistake. Although there are now two lines, Orange and Green, there is still but one track. There are no stops in South Beach, midtown, or north along the US 1 corridor for residents of North Miami, Aventura, and Hollywood. This accounts, in part, for the horrid congestion on I-95. Rail extensions were planned, but shelved when FTA withdrew financial support due to concerns over MDT’s estimates of cost as well as suspicion of government corruption.

Yet, MDT proclaims that 15% of Miami’s population (~70,000 passengers) pass through MetroRail turnstiles every weekday. This misleading percentage only holds when population estimates are restricted to the depopulated City of Miami metro area. This excludes Hialeah, Opa-Locka, Coral Gables, Doral, Miami Beach, North Miami, North Miami Beach, Kendall, South Miami, etc. Residents of all of these regions commute daily through the Miami metro area. Adding in the populations of these cities reduces the number to ~7%. National context is also important. Although 70,000 passengers per day sounds impressive, this number pales in comparison to daily light rail ridership in other major U.S. cities: New York City (5.4 million), Washington D.C. and Chicago (790,000), San Francisco (375,000). Comparatively, the Miami MetroRail is sadly underused.

By comparison....

By comparison…. I give you Washington DC

Can Miami Revamp its Public Transportation System?

Yes. City officials are reexamining a possible ‘Bay Link’ rail line to South Beach. The Bay Link had been considered in the early 2000s, but the issue died in 2004. This line would connect tourists with South Beach directly from the airport and, if coupled with rail extensions to other areas of metropolitan South Florida, remove numerous drunk drivers from the freeways. In addition, public support for rail lines is swelling. Wynwood, a burgeoning neighborhood, created an imaginary ‘Purple Line’ stop to gather support for MetroRail expansion.

Can Miami, and MDT, make it work? Possibly. Voters approved a half-penny tax increase a decade ago to fund MDT expansion. These funds paid for the new Orange line stop at the airport. Unfortunately, progress has been slow and taxpayers are losing faith in city officials to address these problems. The FTA withdrew funding for the MetroRail amid speculation of corruption and misappropriations. A continuing scandal, in which the FTA recently ruled that county officials illegally handled a contract for new rail cars, will not assuage the FTAs concerns and bring back federal funding. One viable option is a Private-Public Partnership (PPP), wherein private investors assume much of the cost, and risk, of infrastructure improvements. PPP programs are taking off around the U.S., including South Florida. The current renovation of the 826-836 exchange in one example, as is the new Port of Miami Tunnel. The question is whether MDT can find investors willing to gamble on a greater Miami MetroRail in a city notorious for publics works projects being far over budget and way behind schedule. Then again, perhaps the risks associated with private investment is the remedy to keep construction costs low and on time. Can Miami make a rail line work?

I certainly hope so, for everyone’s sake.

Side note: I submitted a short version of this as an opinion letter to the Miami Herald, but I’ve not heard back.