A Split-Plot Three-Way ANOVA with STAN and Python

Now that STAN has arrived and has been ported over to Python, I’ve moved all of my data analyses over to Python. At first, it was kind of a pain. Python can do ANOVAs and linear models in an R-like interface, but the stats models module is still under development and there was no support for multi-level models (like split-plots or nested designs). However, with STAN now implemented in Python, it’s possible to code these models yourself and run whatever analyses you want in Python using Bayesian methods and principles (which is even better!).

Here’s an example of a split-plot three-way ANOVA-style analysis programmed in STAN and implemented in Python. Since these data haven’t been published yet, I won’t link to them or discuss them in any way. This post is mainly to help those who might be looking to do something similar. The experimental design has two within-plot factors, chamber temperature and induced, both of which have two levels that I represent as 0 and 1. For example, chamber temperature is either 0 (25˚) or 1 (30˚). Likewise, induced is either 0 (Not Induced) or 1 (Induced). The whole-plot factor is either 0 (Ambient) or 1 (Warmed). This is pretty simply since no factor has more than two levels. There is just one regression coefficient needed to identify each factor.

In a split plot design, the within-plot factors and all interactions involving the within-plot factors occur at the lower level, while the whole-plot factor occurs at the top level. Here it is in STAN. Within-plot regression coefficients were drawn from a multivariate normal distribution and each whole-plot : within-plot treatment combination was given its own variance, rather than assuming constant variance among groups.

splitPlot = """
data{
    int<lower = 0> N;     // number of observations
    int<lower = 0> J;     // number of plots
    real y[N];     // response (Herbivore RGR)
    int plot[N];     // plot identifier
    int induced[N];     // dummy coding for netting treatment (0 = Not Induced, 1 = Induced)
    int temp[N];     // dummy coding for plot temperature (0 = Ambient, 1 = Induced)
    int chamber_temp[N];     // dummy coding for the feeding assay temperature (0 = 25, 1 = 30)
    int var_group[N];     // dummy coding for variance group
    int plot_temp[J];     // dummy coding for plot temperature at the plot level (1 = Ambient, 2 = Induced)
    cov_matrix[6] prior_cov;     // prior for covariance matrix of regression coef
    vector[6] prior_mu;     // prior for mean of regression coefficients (0)
}
parameters{
    real B0[J];     // plot means
    vector[6] B;     // Coefficients
    real G1;     // plant temperature effects
    real mu;     // overall mean
    real <lower = 0, upper = 10> sd_y[8];     // sd for each temperature-chamber-induced group
    real <lower = 0, upper = 10> sd_b0;     // sd of plot-level means
}
transformed parameters{
    vector[N] yhat;
    vector[J] B0hat;
    vector[N] sd_temp;

    for(n in 1:N){
        yhat[n] <- B0[plot[n]] + B[1]*induced[n] + B[2]*chamber_temp[n] + B[3]*induced[n]*temp[n] + B[4]*induced[n]*chamber_temp[n] + B[5]*temp[n]*chamber_temp[n] + B[6]*induced[n]*temp[n]*chamber_temp[n];
        sd_temp[n] <- sd_y[var_group[n]];
    }

    for(j in 1:J){
        B0hat[j] <- mu + G1*plot_temp[j];
    }
}
model{
    y ~ normal(yhat, sd_temp);
    B0 ~ normal(B0hat, sd_b0);

    // PRIOR
    mu ~ normal(0, 4);
    G1 ~ normal(0, 4);
    B ~ multi_normal(prior_mu, prior_cov);
}
generated quantities{
    real induced_ambient25;
    real induced_warmed25;
    real netted_ambient25;
    real netted_warmed25;
    real induced_ambient30;
    real induced_warmed30;
    real netted_ambient30;
    real netted_warmed30;

    induced_ambient25 <- mu + B[1];
    netted_ambient25 <- mu;
    induced_warmed25 <- mu + G1 + B[1] + B[3];
    netted_warmed25 <- mu + G1;

    induced_ambient30 <- mu + B[1] + B[2] + B[4];
    induced_warmed30 <- mu + G1 + B[1] + B[2] + B[3] + B[4] + B[5] + B[6];
    netted_ambient30 <- mu + B[2];
    netted_warmed30 <- mu + G1 + B[2] + B[5];
}
"""

Above is the preferred linear-model based specification. It basically says that the predicted values (y-hat) are a function of the plot mean and any within-plot factors. The plot means themselves are a function of an intercept and the whole-plot factors. The mixing is beautiful (100,000 samples, thinning every 20) and quick (just under 3 mins):

beautiful_posteriors

You also get good coefficient estimates that are easy to interpret. I won’t show them here because the data aren’t published, but each regression coefficient is directly interpretable as a main effect or interaction (this would get more complicated if one of my factors had >2 levels, but still doable).

There is an alternative, cell-means formulation where you just estimate the cell means directly.


splitPlot = """
data{
    int<lower = 0> N; // number of observations
    int<lower = 0> J; // number of plots
    real y[N]; // response (Herbivore RGR)
    int plot[N]; // plot identifier
    int induced[N]; // dummy coding for netting treatment (0 = Not Induced, 1 = Induced)
    int temp[N]; // dummy coding for plot temperature (0 = Ambient, 1 = Induced)
    int chamber_temp[N]; // dummy coding for the feeding assay temperature (0 = 25, 1 = 30)
    int var_group[N]; // dummy coding for variance group
    int plot_temp[J]; // dummy coding for plot temperature at the plot level (1 = Ambient, 2 = Induced)
}
parameters{
    real B0[J]; // plot means
    real B1[2]; // netting effects
    real B3[2];
    real B2[2,2,2];
    real G1[2]; // plant temperature effects
    real mu; // overall mean
    real <lower = 0, upper = 10> sd_y[8]; // common standard deviation
    real <lower = 0, upper = 10> sd_b0;
}
transformed parameters{
    vector[N] yhat;
    vector[J] B0hat;
    vector[N] sd_temp;

    for(n in 1:N){
        yhat[n] <- B0[plot[n]] + B1[induced[n]] + B3[chamber_temp[n]] + B2[induced[n], temp[n], chamber_temp[n]];
        sd_temp[n] <- sd_y[var_group[n]];
    }

    for(j in 1:J){
        B0hat[j] <- mu + G1[plot_temp[j]];
    }
}
model{
    y ~ normal(yhat, sd_temp);
    B0 ~ normal(B0hat, sd_b0);

    // PRIOR
    mu ~ normal(0, 4);
    G1 ~ normal(0, 4);
    B1 ~ normal(0, 4);
    B3 ~ normal(0, 4);
}
generated quantities{
    real grand_mean;
    real A1[2];
    real A2[2];
    matrix[2,2] A3;
    real netted_ambient25;
    real netted_warm25;
    real induced_ambient25;
    real induced_warm25;
    real netted_ambient30;
    real netted_warm30;
    real induced_ambient30;
    real induced_warm30;

    netted_ambient25 <- mu + B1[1] + G1[1] + B2[1,1,1] + B3[1];
    netted_warm25 <- mu + B1[1] + G1[2] + B2[1,2,1] + B3[1];
    induced_ambient25 <- mu + B1[2] + G1[1] + B2[2,1,1] + B3[1];
    induced_warm25 <- mu + B1[2] + G1[2] + B2[2,2,1] + B3[1];

    netted_ambient30 <- mu + B1[1] + G1[1] + B2[1,1,2] + B3[2];
    netted_warm30 <- mu + B1[1] + G1[2] + B2[1,2,2] + B3[2];
    induced_ambient30 <- mu + B1[2] + G1[1] + B2[2,1,2] + B3[2];
    induced_warm30 <- mu + B1[2] + G1[2] + B2[2,2,2] + B3[2];
}
"""

I don’t like this specification as much because the parameters don’t mix well (although the generated quantities themselves are fine). To get good parameter estimates, you need to somehow impose sum-to-zero constraints, either within the STAN model (which I don’t know how to do) or after the fact by manipulating the raw posterior parameter estimates (which I didn’t care enough to figure out). This formula is also MUCH slower, taking over twice as long as the first model. I suspect is has to do with the 3D array for the interaction term, but I’m not positive.

That said, both models give great predictions of the cell means, of which I will only show one tiny bit (black is observed means, red is modeled means):

tiny