R for Ecologists: Permutation Analysis – t-tests

You’ve carefully designed your experiment, you’ve meticulously collected your data, and you have a hypothesis to test. Unfortunately, your data is typical of ecology data: small sample sizes, messy, and non-normal. Your ideal test, the t-test, won’t work because of the non-normality and sample size is too small to invoke the central limit theorem. All hope is not lost!

Permutational analysis are a fantastic way to analyze data from designed experiments where experimental units have been randomly placed among treatments (see Anderson 2001 Canadian Journal of Fisheries and Aquatic Science for a thorough discussion of permutational analyses). In fact, permutational analysis is great for these carefully designed experiments for two main reasons: 1) It frees us from the stringent assumption of normality and equal variances (update, not necessarily true – see comments); there are few distributional assumptions and 2) We can analyze any derived metric we like. Note that the assumption of independent observations still applies. There is just no getting around pseudoreplication.

First, let’s simulate some data. We’ll start with normal data so we can see the equivalence of permutational analyses and parametric tests when assumptions are met.

Let’s imagine you’re interested in phenotypic plasticity of Gambusia and you raise Gambusia under different predation regimes: predator-rich and predator poor. You suspect the Gambusia will have fast growth rates in predator-rich environments (they want to mature faster and have more, smaller offspring when predators are present). We’ll simulate the data on 15 mosquitofish from the two populations: Predator-Present (PP) and Predator-Absent (PA).

# SET RANDOM NUMBER GENERATOR SEED
set.seed(2001)

# SIMULATE NORMAL DATA
PA <- rnorm(15, 4.5, 2)
PP <- rnorm(15, 7.5, 2)

# PLOT HISTOGRAMS
mf <- par(mfrow=c(1,2))
hist(PA, breaks=5)
hist(PP, breaks=5)
par(mf)

Even though the data from one group or the other may look non-normal, we know the underlying distribution is normal because we simulated from a normal distribution. We can use a basic t-test to determine if there are differences between the two populations

t.test(PA, PP)

You get a very significant p-value and a t-statistic of -5.95.

Now we’ll analyze it with a permutational test. Inherent in our experimental design was the random assignment of individuals to PP or PA treatments. We can view our observed data as just one of many possible arrangements of the data. We can shuffle the observations around and ask “If the observations are randomly assigned treatments, what is the probability of observing our particular arrangement of the data” (note that this is still the frequentist version of hypothesis, what is the probability of our data if the null hypothesis of randomness is true).

We first need to know just how many permutations we’re capable of so we don’t wind up repeating combinations. We have 30 observations, split evenly into two groups. There are 30-choose-15 = 155117520 ways to permute the data uniquely. I think we’re safe. (In R, this is choose(30, 15) ).

We’ll follow this pattern: 1) pool all data, 2) randomly assign 15 observations to PA, 3) assign the rest to PP, 4) calculate the difference in means between the groups, 5) repeat for 9999 iterations, 6) compare our observed mean difference to the distribution of possible mean differences under the randomized procedure.

# POOL DATA
pooledData <- c(PA, PP)
# SET THE NUMBER OF ITERATIONS
nIter <- 9999
# SET UP A CONTAINER FOR PERMUTED DIFFERENCES. ADD IN A SLOT FOR THE OBSERVED VALUE
meanDiff <- numeric(nIter+1)
# CALCULATE THE OBSERVED MEAN DIFFERENCE
meanDiff[1] <- mean(PP) - mean(PA)
# RUN THE ITERATION IN A FOR() LOOP
for(i in 2:length(meanDiff)){ # start from 2 to avoid overwriting the observed difference
 index <- sample(1:30, size=15, replace=F) # Sample numbers 1-30 15 times and store in an index
 PAperm <- pooledData[index] # Assign the sampled values to PA
 PPperm <- pooledData[-index] # Assign everything else to PP
 meanDiff[i] <- mean(PPperm) - mean(PAperm) # Calcualte and store the difference in means
}

# PLOT HISTORGRAM OF DIFFERENCES IN MEANS
hist(meanDiff, xlab='Difference in PP and PA means', prob=T, main='')
# ADD IN A LINE FOR OUR OBSERVED VALUE
abline(v=meanDiff[1], lty=2, col='red')

# CALCULATE THE P-VALUE. USE THE ABSOLUTE VALUE FOR A TWO-TAILED TEST
mean(abs(meanDiff) >= abs(meanDiff[1]))

The last line is a little trick where you get R to assign all differences greater than our observed difference a value of 1 and everything else gets assigned 0. Then you add up the 1’s and divide by the number of observations to get the proportion of observations greater than our observed statistic. This is the same as taking a mean of a vector of 1’s and 0’s. For example, if I have two 1’s and two 0’s, then (1+1+0+0)/4 = 0.5, which is the same as mean(1,1,0,0).

The p-value is incredibly low, similar to what we saw with the t-test.

You can try the same thing with log-normal data

# SIMULATE LOG-NORMAL DATA
PAlnorm <- rlnorm(n=15, mean=log(4.5), sd=log(2))
PPlnorm <- rlnorm(n=15, mean=log(7.5), sd=log(2))

mf <- par(mfrow=c(1,2))
hist(PAlnorm);hist(PPlnorm)
par(mf)

# t-Test
t.test(PAlnorm, PPlnorm)

# POOL DATA
pooledData <- c(PAlnorm, PPlnorm)
# SET THE NUMBER OF ITERATIONS
nIter <- 9999
# SET UP A CONTAINER FOR PERMUTED DIFFERENCES. ADD IN A SLOT FOR THE OBSERVED VALUE
meanDiff <- numeric(nIter+1)
# CALCULATE THE OBSERVED MEAN DIFFERENCE
meanDiff[1] <- mean(PPlnorm) - mean(PAlnorm)
# RUN THE ITERATION IN A FOR() LOOP
for(i in 2:length(meanDiff)){ # start from 2 to avoid overwriting the observed difference
 index <- sample(1:30, size=15, replace=F) # Sample numbers 1-30 15 times and store in an index
 PAperm <- pooledData[index] # Assign the sampled values to PA
 PPperm <- pooledData[-index] # Assign everything else to PP
 meanDiff[i] <- mean(PPperm) - mean(PAperm) # Calcualte and store the difference in means
}

# PLOT HISTORGRAM OF DIFFERENCES IN MEANS
hist(meanDiff, xlab='Difference in PP and PA means', prob=T, main='')
# ADD IN A LINE FOR OUR OBSERVED VALUE
abline(v=meanDiff[1], lty=2, col='red')

mean(abs(meanDiff) >= abs(meanDiff[1]))

The results of the t-test and permutational analysis are again equivalent, although this will not always be the case. (My personal experience is that it is usually the case, so parametric tests are probably more robust to departures from assumptions than we’ve been led to believe).

This code described a t-test. Doing an ANOVA requires a few extra lines of code but is conceptually similar.

First Steps in Fighting Climate Change

California has taken some critical first steps in fighting climate change: The state has instituted a cap-and-trade system of carbon credits for industries across the economy to lower CO2 emissions to 1990 levels by 2020. This is a big, great step in the right direction to reduce carbon dioxide emissions.

A couple of pros and cons, as I semi-understand them:

Pro: If successful, carbon dioxide emissions in the state of California will be lowered 30% by 2020.

Pro: This has smatterings of how Republicans might view carbon dioxide regulation (based on their arguments about health care): allow the states to tailor legislation to their own needs. One size might not fit all. So this might be a template bi-partisan CO2 legislation.

Pro: It allows ‘offsets’, in that companies can offset carbon dioxide production 1-for-1 by taking activities that would soak of carbon dioxide, such as planting forests (trees are like carbon dioxide storage containers). Thus, companies can still emit CO2 as long as they can demonstrate that they are absorbing an equal amount from the atmosphere. (This too has issues: Point: “are we just setting the problem aside for later?” Counter-point: “Not really, we’re offsetting the rampant deforestation ongoing in the tropics, so we’re really putting things back into balance”. and so on).

Pro: There is oversight from independent parties that double check companies reported CO2 sequestration figures (i.e. does the forest they planted soak up as much CO2 as they say it does).

Pro: If successful, this could provide a template for a bigger, national program.

Con: It’s fairly well-known among economists and people who study this sort of thing that cap-and-trade systems are not an ideal solution. Carbon taxes are much more effective and would not only reduce the incentive to release CO2 but spur innovation into new technologies. For more info, go here. This is only one point, but it’s worth at least three of the above ‘Pros’, because it’s a big point.

Con: It is only a state-based system, so such legislation might just make companies uproot and move elsewhere where there are not CO2 regulations. This is damaging because 1) it neutralizes the legislation and 2) it hurts the California economy. This is argument for federal legislation.

Con: If it fails, it’s another brick in the wall against CO2 legislation.

Con: Companies are savvy, there exists the very real possibility of exploiting loopholes or creating fraudulent carbon credits.

I’m not an expert, I’m no economist, and I know almost nothing about public policy. However, I am (in training to be) an ecologist. So I cannot debate the finer points of whether this is or isn’t the right program, but I do know that it’s necessary. Therefore, whether is be ideal or no, I support California’s steps in the right direction. Even if it doesn’t work, we’ll have learned how to make it better.

The question is: Who’s next? Will anyone follow? Can/will/should the federal government step up?

Loading Packages and Functions Automatically in R

For a long time, I was wondering how to get R to automatically load packages I use every time I open the program. For example, for a variety of reasons, I use the ‘Cairo’ package to save all my figures as PDFs. On macs, the Cairo output does funky things, like making all fonts in the figures bold and italics, so I also have to run a command resetting the fonts in Cairo to look normal. This is fine to do once, but when you do it every time you open R, it gets to be a pain (especially when you open and close R multiple times a day). Other people I know have custom functions saved as R scripts that want to load automatically. So the question was: How do you get R to do this?

The short answer: Set up an R profile. Good luck figuring this out, it took me a while. However, eventually I did manage to do it and it’s much easier than any of the help files make it sound. I only know how to do this for a mac, by the way, so if you run Windows (gross) or Linux (sweet), this article might help you in principle but not specifics.

The first thing you need to know is that you should make a text document. Call it whatever you want, I call mine ‘Rprofile.txt’. Put it in your home directory (the directory that houses the ‘Applications’, ‘Documents’, etc. folders). You can tell R to do things on startup and exit, which can be handy for automatically saving history to designated file. To do this, you need two functions called .First and .Last, which do exactly what they sound like: .First is a function of commands you want run on startup, .Last is a function of commands to run on exit. My profile looks like this:

.First <- function(){
library(Cairo)
CairoFonts("Arial:style=Regular","Arial:style=Bold","Arial:style=Italic","Helvetica","Symbol")
cat("\nWelcome at", date(), "\nLoaded Cairo Package With Correct Fonts\n")

}

.Last <- function(){
 cat("\nGoodbye at ", date(), "\n")
}

Let’s go line by line:

1) .First <- function() { – this line says “make a function called .First”. Once you’re done, you fill the function with commands you want R to run on startup. Any command within the {} will then be run on startup.

2) library(Cairo) – load the library Cairo.

3) CairoFonts(…) – Fix the Cairo fonts

4) cat(…) – Tell me hello and remind me what I’ve loaded.

5) } – end the function

If you want R to load scripts, include a source(‘path_to_script’) line inside the .First function.

The same goes for .Last: make a function called .Last, insert whatever commands you want R to run on exit, and end the function.

So now you have an Rprofile text file of all the things you want R to do. Now, you have to get R to find it. R automatically searches your Home directory for a file called ‘.Rprofile’ (it’s hidden, so you won’t see it unless you enable viewing hidden files). We use the terminal to convert Rprofile.txt to .Rprofile.

Open up the terminal application and type ‘cp Rprofile.txt .Rprofile’. It’s simple: cp is ‘copy’, so your command is ‘copy Rprofile.txt to .Rprofile’.

Done! It’s as simple as that. Note that you can actually keep the Rprofile.txt file anywhere. As long as you are within your home directory in terminal, all you need to change is ‘cp path_to_file/Rprofile.txt .Rprofile’, where path_to_file is the file path to the txt file (e.g. if I stored it in my Documents folder I’d type cp Documents/Rprofile.txt .Rprofile).

I hope this saves you some time. I also won’t be held responsible if you get access to your friend’s mac and make R leave them strange messages whenever it starts up.