R vs Python: Practical Data Analysis (Nonlinear Regression)

I’ve written a few previous posts comparing R to Python in terms of symbolic math, optimization, and bootstrapping. All of these posts were pretty popular. The last one especially. Many of the commenters brought up the fact that R, while maybe not as fast (although that too is debatable) is much better for data analysis because of the huge number of libraries, tests, and its syntactical advantages (i.e. using formulas). For the record, I agree with all of these comments. If you don’t, go look at the Journal of Statistical Software and search for R and then Python. I will never (probably) give up R for Python because it just works too well. However, I was intrigued.

I set out last week to answer a single question: Can I use Python to replicate the data analysis of my paper “Temperature-induced mismatches between consumption and metabolism reduce consumer fitness“? If so, just how hard is it? I chose this paper because the analyses were, I thought, pretty simple: A bunch of nonlinear regressions using AIC model comparisons and some linear mixed effects models.

I found out the answer in about 30 seconds: Nope. Python doesn’t have a mixed-effects models module (there’s some code in the statsmodels module but its not finished). OK, so that was that. But, I kept going. I changed the question to: Can Python do nonlinear regression part of my paper?

The answer? Yes, and no. It can do least-squares curve fitting, but it only provides you with parameter estimates and, if you use curve_fit( ), a covariance matrix for the parameters. I needed things like AIC (which it didn’t provide), the log-likelihood (also not provided), t-values, p-values, etc. (also not provided). You can program these things yourself, if you know how to calculate them and code it.

After three evenings and many, many hours of searching, I realized that I couldn’t get Python to give me covariance matrices for maximum likelihood models (which I desperately wanted), I could only get them from least squares using the leastsq function. So I decided to work with that. Still, three days to figure that out and trying to make it work. (side note: I think it’s entirely possible to get covariance matrices for parameters from maximum likelihood models, but its above what I know how to do. You need the Hessian, which isn’t always returned from Python’s minimize function, depending on the method used).

So, the trick then is to figure out how to get the information I need from the least squares procedure to calculate the p-values, t-values, AIC, etc. Now, for the paper I ran twelve models or so, so it seems like making a class to run these models would be easier than doing the calculations by hand for every model. Two more days of searching and trial-and-error, and I came up with this:

class NLS:
    ''' This provides a wrapper for scipy.optimize.leastsq to get the relevant output for nonlinear least squares. Although scipy provides curve_fit for that reason, curve_fit only returns parameter estimates and covariances. This wrapper returns numerous statistics and diagnostics'''

    import numpy as np
    import scipy.stats as spst
    from scipy.optimize import leastsq

    def __init__(self, func, p0, xdata, ydata):
        # Check the data
        if len(xdata) != len(ydata):
            msg = 'The number of observations does not match the number of rows for the predictors'
            raise ValueError(msg)

        # Check parameter estimates
        if type(p0) != dict:
            msg = "Initial parameter estimates (p0) must be a dictionry of form p0={'a':1, 'b':2, etc}"
            raise ValueError(msg)

        self.func = func
        self.inits = p0.values()
        self.xdata = xdata
        self.ydata = ydata
        self.nobs = len( ydata )
        self.nparm= len( self.inits )

        self.parmNames = p0.keys()

        for i in range( len(self.parmNames) ):
            if len(self.parmNames[i]) > 5:
                self.parmNames[i] = self.parmNames[i][0:4]

        # Run the model
        self.mod1 = leastsq(self.func, self.inits, args = (self.xdata, self.ydata), full_output=1)

        # Get the parameters
        self.parmEsts = np.round( self.mod1[0], 4 )

        # Get the Error variance and standard deviation
        self.RSS = np.sum( self.mod1[2]['fvec']**2 )
        self.df = self.nobs - self.nparm
        self.MSE = self.RSS / self.df
        self.RMSE = np.sqrt( self.MSE )

        # Get the covariance matrix
        self.cov = self.MSE * self.mod1[1]

        # Get parameter standard errors
        self.parmSE = np.diag( np.sqrt( self.cov ) )

        # Calculate the t-values
        self.tvals = self.parmEsts/self.parmSE

        # Get p-values
        self.pvals = (1 - spst.t.cdf( np.abs(self.tvals), self.df))*2

        # Get biased variance (MLE) and calculate log-likehood
        self.s2b = self.RSS / self.nobs
        self.logLik = -self.nobs/2 * np.log(2*np.pi) - self.nobs/2 * np.log(self.s2b) - 1/(2*self.s2b) * self.RSS

        del(self.mod1)
        del(self.s2b)
        del(self.inits)

    # Get AIC. Add 1 to the df to account for estimation of standard error
    def AIC(self, k=2):
    return -2*self.logLik + k*(self.nparm + 1)

    del(np)
    del(leastsq)

    # Print the summary
    def summary(self):
        print
        print 'Non-linear least squares'
        print 'Model: ' + self.func.func_name
        print 'Parameters:'
        print " Estimate Std. Error t-value P(>|t|)"
        for i in range( len(self.parmNames) ):
                print "% -5s % 5.4f % 5.4f % 5.4f % 5.4f" % tuple( [self.parmNames[i], self.parmEsts[i], self.parmSE[i], self.tvals[i], self.pvals[i]] )
        print
        print 'Residual Standard Error: % 5.4f' % self.RMSE
        print 'Df: %i' % self.df

Now, after that (a week), I can run my NLS model and get the output I need. Here’s the sample code:

import pandas as pd

respData = pd.read_csv(my datafile here)
respData['respDaily'] = respData['C_Resp_Mass'] * 24

# Create the Arrhenius temperature

respData['Ar'] = -1 / (8.617 * 10**-5 * (respData['Temp']+273))

# First, define the likelihood null model
def nullMod(params, mass, yObs):
    a = params[0]
    c = params[1]

    yHat = a*mass**c
    err = yObs - yHat
    return(err)

p0 = {'a':1, 'b':1}

tMod = NLS(nullMod, p0, respData['UrchinMass'], respData['respDaily'] )

tMod.summary()

tMod.AIC()

tMod.logLik

The above NLS code is the first that I know if in Python to do statistical NLS (not just curve fitting) and get the output I needed. But, it wasn’t easy, it took me about a week of my off (and on) hours. The following R code to do the same takes me maybe 15 minutes:

# Import the data
respiration <- read.csv( my file)
# Standardize to 24 h
respiration$C_Resp_Day <- respiration$C_Resp_Mass*24
# Create the Arrhenius temperature
respiration$Ar<--1/(8.617*10^-5*(respiration$Temp+273))

Null.mod <- nls(C_Resp_Day~a*UrchinMass^c, data=respiration, start=list(a=exp(6),c=0.25))

summary(Null.mod)
AIC(Null.mod)
logLik(Null.mod)

R is just so much easier for real data analysis and has more functions. Python can do these things, but the modules are scattered (there’s at least three separate modules to fit curves that people have written to do different things) and don’t always give the needed output. My NLS class above is, more or less, a replica of R’s NLS output and the missing pieces can be easily added.

R is currently head-and-shoulders above Python for data analysis, but I remain convinced that Python CAN catch up, easily and quickly. It is entirely possible to do your analysis in Python if you want to spend the time coding the analyses yourself. I may do this sometime, if only because it really makes me learn statistics really well. But by and large, R is the way to go (for now). I would not be surprised that, if 10 years or so, Python’s data analyses libraries coalesce and evolve into something that can rival R.

For those of you interested in trying this out, the data can be downloaded here.

R vs Python Speed Comparison for Bootstrapping

I’m interested in Python a lot, mostly because it appears to be wickedly fast. The downside is that I don’t know it nearly as well as R, so any speed gain in computation time is more than offset by Google time trying to figure out how to do what I want. However, I’m curious as to know how much faster Python is for things like simulating data (because I have a particular interest in neutral models and data simulation). So I faked a linear regression and bootstrapped the confidence interval on the slope. Since I’m only interested in the comparison in speed for bootstrapping, that’s all I’m going to show.

First, with R:

set.seed(101)

# generate data
x <- 0:100
y <- 2*x + rnorm(101, 0, 10)

# plot data
plot(x, y)

# run the regression
mod1 <- lm(y ~ x)
yHat <- fitted(mod1)

# get the residuals
errors <- resid(mod1)

# make a bootstrapping function
boot <- function(n = 10000){
 b1 <- numeric(n)
 b1[1] <- coef(mod1)[2]

 for(i in 2:n){
 residBoot <- sample(errors, replace=F)
 yBoot <- yHat + residBoot
 modBoot <- lm(yBoot ~ x)
 b1[i] <- coef(modBoot)[2]
 }

 return(b1)
}

# Run the bootstrapping function
system.time( bootB1 <- boot() )
mean(bootB1)

On my system, it takes about 10.5 seconds to run this bootstrap.

In Python:

import numpy as np
import pylab as py
import statsmodels.api as sm
import time

# Create the data
x = np.arange(0, 101)
y = 2*x + np.random.normal(0, 10, 101)

# Add the column of ones for the intercept
X = sm.add_constant(x, prepend=False)

# Plot the data
py.clf()
py.plot(x, y, 'bo', markersize=10)

# Define the OLS models
mod1 = sm.OLS(y, X)

# Fit the OLS model
results = mod1.fit()

# Get the fitted values
yHat = results.fittedvalues

# Get the residuals
resid = results.resid

# Set bootstrap size
n = 10000

t0 = time.time()
# Save the slope
b1 = np.zeros( (n) )
b1[0] = results.params[0]

for i in np.arange(1, 10000):
    residBoot = np.random.permutation(resid)
    yBoot = yHat + residBoot
    modBoot = sm.OLS(yBoot, X)
    resultsBoot = modBoot.fit()
    b1[i] = resultsBoot.params[0]

t1 = time.time()

elapsedTime = t1 - t0
print elapsedTime

np.mean(b1)

On my system, the elapsed time is about 4.7 seconds. That’s a substantial increase in speed. I can only assume that for larger simulations, the time difference would be greater. Thus, it seems like Python may be the way to go for permutational analyses, especially when data are simulated.

So what I have learned about Python so far?

PROS:

  • Super fast
  • Language is ‘relatively easy’ to learn
  • Has some built in stats capabilities
  • Has great symbolic math capabilities

CONS:

  • Language is different enough that learning is going to be slow
  • Statistical capabilities are sparse, and R is an easy statistical language (so far)

Overall, if Python had good stats capabilities, I’d probably switch all together. As it is, I’m considering dropping R for things like modeling and simulations just because Python is so much faster. I’m also reading Numerical Ecology right now, and I’m considering sitting down and putting together the beginnings of a multivariate ecology stats module, but I bet that’s going to be more work than I have time for. It’d be pretty cool though.

Does Google have double-standards when it comes to privacy?

I’m not suggesting that they do, it’s merely a question. It is, however, one that I’ve wondered about ever since the story broke about the NSA searching emails. For those of you that have been living under a rock, the story is that the NSA collects emails between Americans and people living overseas, as well as Americans with links to those living overseas. The NSA has openly acknowledged the first bit, but the second part was new. Of course, this controversy isn’t.

Google was implicated in this program as an entity that allows the government unrestricted access to user data. Google fought this, saying that this was inaccurate and impossible and that their reputation has been harmed by the suggestions. I have no doubt that this is true. However, recent documents suggest that the NSA has the ability to collect the emails, browser history, and chat history of anyone in the country. Not surprisingly, people were upset by this.

Here’s where I get confused. People are upset that the NSA reads their emails and whatnot. That’s fine. However, we’ve known for sometime that Google does this same thing. They admit it on their website. In fact, Google appears to be storing nearly everything you do on the internet (ironically, the linked article somewhat predicted the NSA scandal and its fallout). This is similar to what the NSA does, supposedly, yet no one seems to care. Even weirder, Google’s Street-View cars were gathering information on all wifi hotspots they encountered during their search and keeping the data transmitted over them! It’s like being wiretapped on your phone. Recently, Google made a statement that its users can’t expect privacy in their emails or searches (remember, they’re suing because their reputation was damaged by implications that the government was doing the same thing).

So I’m a bit confused here. NSA = bad internet snooping, email scanning, and data storage. Google = .. OK? Indifferent? internet snooping, email scanning, wifi commandeering, and data storage. Now, personally, I really don’t care if the government has my internet history data. It’s really boring. I visit six websites, xkcd, questionablecontent (which is a comic strip), ESPN, NPR, Cracked, and WordPress. Personally, I’m much more comfortable with the government storing my data than Google, which actually scans and uses it (to target ads). Google appears to have much less oversight than the NSA (which had little enough) regarding how it chooses to collect and use data. I guess my question is, why are more people not upset that Google does this, given the response to the NSA scandal? They’re very similar. Is it just that people don’t know? Do they not care when its a company versus the government? Personally, I love Gmail, but I’d stop if I could. I can’t, because switching email addresses would be enormously difficult at this point. But I wish I could.

I hate rabbits

I’ve started my warming experiment, my plants are growing, life is starting to look up. Except for the rabbits. I put up a fence around my gardens to keep the rabbits out. They chewed through the fence. There’s no way of knowing just how many plants those little buggers (I had a worse name for them) have killed. I’m in the process of putting something heavy up, but until then, my garden is at the mercy of what appears to be a plague of fluffy, long-eared rodents.

IMG958860

Time to call in the rabbit hunter.