R vs Python: Practical Data Analysis (Nonlinear Regression)

I’ve written a few previous posts comparing R to Python in terms of symbolic math, optimization, and bootstrapping. All of these posts were pretty popular. The last one especially. Many of the commenters brought up the fact that R, while maybe not as fast (although that too is debatable) is much better for data analysis because of the huge number of libraries, tests, and its syntactical advantages (i.e. using formulas). For the record, I agree with all of these comments. If you don’t, go look at the Journal of Statistical Software and search for R and then Python. I will never (probably) give up R for Python because it just works too well. However, I was intrigued.

I set out last week to answer a single question: Can I use Python to replicate the data analysis of my paper “Temperature-induced mismatches between consumption and metabolism reduce consumer fitness“? If so, just how hard is it? I chose this paper because the analyses were, I thought, pretty simple: A bunch of nonlinear regressions using AIC model comparisons and some linear mixed effects models.

I found out the answer in about 30 seconds: Nope. Python doesn’t have a mixed-effects models module (there’s some code in the statsmodels module but its not finished). OK, so that was that. But, I kept going. I changed the question to: Can Python do nonlinear regression part of my paper?

The answer? Yes, and no. It can do least-squares curve fitting, but it only provides you with parameter estimates and, if you use curve_fit( ), a covariance matrix for the parameters. I needed things like AIC (which it didn’t provide), the log-likelihood (also not provided), t-values, p-values, etc. (also not provided). You can program these things yourself, if you know how to calculate them and code it.

After three evenings and many, many hours of searching, I realized that I couldn’t get Python to give me covariance matrices for maximum likelihood models (which I desperately wanted), I could only get them from least squares using the leastsq function. So I decided to work with that. Still, three days to figure that out and trying to make it work. (side note: I think it’s entirely possible to get covariance matrices for parameters from maximum likelihood models, but its above what I know how to do. You need the Hessian, which isn’t always returned from Python’s minimize function, depending on the method used).

So, the trick then is to figure out how to get the information I need from the least squares procedure to calculate the p-values, t-values, AIC, etc. Now, for the paper I ran twelve models or so, so it seems like making a class to run these models would be easier than doing the calculations by hand for every model. Two more days of searching and trial-and-error, and I came up with this:

class NLS:
    ''' This provides a wrapper for scipy.optimize.leastsq to get the relevant output for nonlinear least squares. Although scipy provides curve_fit for that reason, curve_fit only returns parameter estimates and covariances. This wrapper returns numerous statistics and diagnostics'''

    import numpy as np
    import scipy.stats as spst
    from scipy.optimize import leastsq

    def __init__(self, func, p0, xdata, ydata):
        # Check the data
        if len(xdata) != len(ydata):
            msg = 'The number of observations does not match the number of rows for the predictors'
            raise ValueError(msg)

        # Check parameter estimates
        if type(p0) != dict:
            msg = "Initial parameter estimates (p0) must be a dictionry of form p0={'a':1, 'b':2, etc}"
            raise ValueError(msg)

        self.func = func
        self.inits = p0.values()
        self.xdata = xdata
        self.ydata = ydata
        self.nobs = len( ydata )
        self.nparm= len( self.inits )

        self.parmNames = p0.keys()

        for i in range( len(self.parmNames) ):
            if len(self.parmNames[i]) > 5:
                self.parmNames[i] = self.parmNames[i][0:4]

        # Run the model
        self.mod1 = leastsq(self.func, self.inits, args = (self.xdata, self.ydata), full_output=1)

        # Get the parameters
        self.parmEsts = np.round( self.mod1[0], 4 )

        # Get the Error variance and standard deviation
        self.RSS = np.sum( self.mod1[2]['fvec']**2 )
        self.df = self.nobs - self.nparm
        self.MSE = self.RSS / self.df
        self.RMSE = np.sqrt( self.MSE )

        # Get the covariance matrix
        self.cov = self.MSE * self.mod1[1]

        # Get parameter standard errors
        self.parmSE = np.diag( np.sqrt( self.cov ) )

        # Calculate the t-values
        self.tvals = self.parmEsts/self.parmSE

        # Get p-values
        self.pvals = (1 - spst.t.cdf( np.abs(self.tvals), self.df))*2

        # Get biased variance (MLE) and calculate log-likehood
        self.s2b = self.RSS / self.nobs
        self.logLik = -self.nobs/2 * np.log(2*np.pi) - self.nobs/2 * np.log(self.s2b) - 1/(2*self.s2b) * self.RSS

        del(self.mod1)
        del(self.s2b)
        del(self.inits)

    # Get AIC. Add 1 to the df to account for estimation of standard error
    def AIC(self, k=2):
    return -2*self.logLik + k*(self.nparm + 1)

    del(np)
    del(leastsq)

    # Print the summary
    def summary(self):
        print
        print 'Non-linear least squares'
        print 'Model: ' + self.func.func_name
        print 'Parameters:'
        print " Estimate Std. Error t-value P(>|t|)"
        for i in range( len(self.parmNames) ):
                print "% -5s % 5.4f % 5.4f % 5.4f % 5.4f" % tuple( [self.parmNames[i], self.parmEsts[i], self.parmSE[i], self.tvals[i], self.pvals[i]] )
        print
        print 'Residual Standard Error: % 5.4f' % self.RMSE
        print 'Df: %i' % self.df

Now, after that (a week), I can run my NLS model and get the output I need. Here’s the sample code:

import pandas as pd

respData = pd.read_csv(my datafile here)
respData['respDaily'] = respData['C_Resp_Mass'] * 24

# Create the Arrhenius temperature

respData['Ar'] = -1 / (8.617 * 10**-5 * (respData['Temp']+273))

# First, define the likelihood null model
def nullMod(params, mass, yObs):
    a = params[0]
    c = params[1]

    yHat = a*mass**c
    err = yObs - yHat
    return(err)

p0 = {'a':1, 'b':1}

tMod = NLS(nullMod, p0, respData['UrchinMass'], respData['respDaily'] )

tMod.summary()

tMod.AIC()

tMod.logLik

The above NLS code is the first that I know if in Python to do statistical NLS (not just curve fitting) and get the output I needed. But, it wasn’t easy, it took me about a week of my off (and on) hours. The following R code to do the same takes me maybe 15 minutes:

# Import the data
respiration <- read.csv( my file)
# Standardize to 24 h
respiration$C_Resp_Day <- respiration$C_Resp_Mass*24
# Create the Arrhenius temperature
respiration$Ar<--1/(8.617*10^-5*(respiration$Temp+273))

Null.mod <- nls(C_Resp_Day~a*UrchinMass^c, data=respiration, start=list(a=exp(6),c=0.25))

summary(Null.mod)
AIC(Null.mod)
logLik(Null.mod)

R is just so much easier for real data analysis and has more functions. Python can do these things, but the modules are scattered (there’s at least three separate modules to fit curves that people have written to do different things) and don’t always give the needed output. My NLS class above is, more or less, a replica of R’s NLS output and the missing pieces can be easily added.

R is currently head-and-shoulders above Python for data analysis, but I remain convinced that Python CAN catch up, easily and quickly. It is entirely possible to do your analysis in Python if you want to spend the time coding the analyses yourself. I may do this sometime, if only because it really makes me learn statistics really well. But by and large, R is the way to go (for now). I would not be surprised that, if 10 years or so, Python’s data analyses libraries coalesce and evolve into something that can rival R.

For those of you interested in trying this out, the data can be downloaded here.

About these ads

5 thoughts on “R vs Python: Practical Data Analysis (Nonlinear Regression)

  1. What is the best tool? Depends on what you want to do. Python is a very good overall programming language, working nicely in most fields of computing. R is not. Have you ever tried programming Pacman or Tetris in R? No. If you want to do lots of things with your computer and data analysis is amongst them, turn to Python. If you know Python and Numpy really well, there is no point in learning a different language just for some data crunching. Stick to it.
    However, in it’s special field R can not be beaten by Python. Have a look at CRAN and you’ll soon discover, that Python 2.x and 3.x is never going to have that many working statistic packages, ever! Coding everything for yourself, as has been proposed, is not only time consuming. More: It is prone to errors.

    Once you know R, use R for data analysis. Use Python or whatever for anything else and combine them with RPy whenever there is a need. Think RPy http://rpy.sourceforge.net/ .
    If speed really should become of the essence – I mean really, really important like waiting for the computer over the weekend does not do the job, don’t turn to an interpreted language: Think about a compiled language like C++, Java, Haskell, Julia, Groovy, …

  2. This post is a combination of ‘that’s cool’ and ‘so futile’. Sure, one could even write VBA code in Excel to solve this problem, but wouldn’t it be better to use that time to look at alternative models, for example? I think Bernhard’s first comment provides a good way of combining both languages if needed.

  3. Pingback: [Note] ggplot2 in Python (Rpy2) | m's Bioinformatics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s