R vs Python Speed Comparison for Bootstrapping

I’m interested in Python a lot, mostly because it appears to be wickedly fast. The downside is that I don’t know it nearly as well as R, so any speed gain in computation time is more than offset by Google time trying to figure out how to do what I want. However, I’m curious as to know how much faster Python is for things like simulating data (because I have a particular interest in neutral models and data simulation). So I faked a linear regression and bootstrapped the confidence interval on the slope. Since I’m only interested in the comparison in speed for bootstrapping, that’s all I’m going to show.

First, with R:

set.seed(101)

# generate data
x <- 0:100
y <- 2*x + rnorm(101, 0, 10)

# plot data
plot(x, y)

# run the regression
mod1 <- lm(y ~ x)
yHat <- fitted(mod1)

# get the residuals
errors <- resid(mod1)

# make a bootstrapping function
boot <- function(n = 10000){
 b1 <- numeric(n)
 b1[1] <- coef(mod1)[2]

 for(i in 2:n){
 residBoot <- sample(errors, replace=F)
 yBoot <- yHat + residBoot
 modBoot <- lm(yBoot ~ x)
 b1[i] <- coef(modBoot)[2]
 }

 return(b1)
}

# Run the bootstrapping function
system.time( bootB1 <- boot() )
mean(bootB1)

On my system, it takes about 10.5 seconds to run this bootstrap.

In Python:

import numpy as np
import pylab as py
import statsmodels.api as sm
import time

# Create the data
x = np.arange(0, 101)
y = 2*x + np.random.normal(0, 10, 101)

# Add the column of ones for the intercept
X = sm.add_constant(x, prepend=False)

# Plot the data
py.clf()
py.plot(x, y, 'bo', markersize=10)

# Define the OLS models
mod1 = sm.OLS(y, X)

# Fit the OLS model
results = mod1.fit()

# Get the fitted values
yHat = results.fittedvalues

# Get the residuals
resid = results.resid

# Set bootstrap size
n = 10000

t0 = time.time()
# Save the slope
b1 = np.zeros( (n) )
b1[0] = results.params[0]

for i in np.arange(1, 10000):
    residBoot = np.random.permutation(resid)
    yBoot = yHat + residBoot
    modBoot = sm.OLS(yBoot, X)
    resultsBoot = modBoot.fit()
    b1[i] = resultsBoot.params[0]

t1 = time.time()

elapsedTime = t1 - t0
print elapsedTime

np.mean(b1)

On my system, the elapsed time is about 4.7 seconds. That’s a substantial increase in speed. I can only assume that for larger simulations, the time difference would be greater. Thus, it seems like Python may be the way to go for permutational analyses, especially when data are simulated.

So what I have learned about Python so far?

PROS:

  • Super fast
  • Language is ‘relatively easy’ to learn
  • Has some built in stats capabilities
  • Has great symbolic math capabilities

CONS:

  • Language is different enough that learning is going to be slow
  • Statistical capabilities are sparse, and R is an easy statistical language (so far)

Overall, if Python had good stats capabilities, I’d probably switch all together. As it is, I’m considering dropping R for things like modeling and simulations just because Python is so much faster. I’m also reading Numerical Ecology right now, and I’m considering sitting down and putting together the beginnings of a multivariate ecology stats module, but I bet that’s going to be more work than I have time for. It’d be pretty cool though.

About these ads

44 thoughts on “R vs Python Speed Comparison for Bootstrapping

    • Spend it on google trying figure out Python, probably. I also have much more complex models that take a day and a half to run in R that I may work into Python to test the speed difference in something that really does eat up a considerable amount of time.

      • But why not spend a similar amount of time trying to make them fast in R? Or trying to figure out a better algorithm?

        Especially for this specific example, you have two choices:

        (1) invest a little time in learning more R and make your code much faster
        (2) invest a lot of time learning another programming language and make your code much faster.

        It’s hard to give general advice about performance optimisation because so much depends on what you already know, but in general if you know some R already and don’t know (much) python, you’d be better off optimising in R.

      • Agreed. I think I said somewhere in the post that the time saved by running the code in Python is way overshadowed by the time it took me to figure out how to do it in Python in the first place (either R documentation is better than Python’s or I just have more experience reading it). If not, I’ll say it here for the record: took me more than six seconds to figure out how to run a linear model in Python. I mention in the cons for Python that its language for stats is pretty poor, and I was implicitly referring to model specification (to reference your previous comment). There are ‘formula-type’ interfaces for Python, but… I haven’t been overly impressed so far. Python is still lightyears behind R in terms of statistical capabilities (although I’m curious to see if they develop).

    • This brings up an interesting point, and echoes a conversation I once had with someone way smarter than me about a previous post. In that post, I compared Python and R’s optimization times and found that Python was an order of magnitude faster (40 seconds in R vs. 4 seconds in Python). I was told that there are a variety of tricks one can use, provided you know them, to speed up computations in R. However, he also mentioned that the vast majority of applied users (like myself) don’t know those tricks (nor the difference between high- and low-level functions), so then the comparison is actually quite interesting because its a comparison of what the average end-user experiences. So, while it is true that I might be comparing apples to… pythons?.. in terms of model fitting functions for bootstrapping, I believe that the comparison is actually valid because it’s the standard steps the average useR might go through to bootstrap in both languages. For example, I more or less used the code from John Fox’s section on bootstrapping regressions in Applied Regression.

      Your comment also brings up a second interesting point. I’ve not tried it yet, but your results would suggest that the model-fitting is the bottleneck for speed in the for() loop. I may come up with a second example of resampling or some other iterative process than doesn’t rely on model fitting and test the speed difference there.

      • I agree with your remarks in general, but it’s particularly problematic in this case because you compare a high-level R function designed to be robust against bad inputs to a low-level python function designed to be fast. To be fair to R, you also need to average in the cost of the one time where you provide numerically unstable input and R complains nicely but python fails silently and you spend two hours trying to track down the problem ;)

        For more complicated models, you’ll also need to spend more time writing the model specification in python compared to R (although there are python libraries that provide an interface much like R’s formula interface)

      • It should be noted that statsmodels does have a high-level interface (statsmodels.formula.api) that uses pandas and formulas. This would have provided for a fairer comparison of the higher-level regression tools. As someone who does analyses in both languages I don’t buy the argument that Python users wouldn’t use an analogous high-level interface.

      • I looked into the statsmodels.formula.api method intially, it was my first choice. But it seemed as though the data (X and Y) had to be in a pandas dataframe for the formula to work, it wouldn’t work with X and Y as their own objects. And since I did this last night at 11:30, I didn’t bother to mess around with figuring out how to get X and Y into the right format (I know almost nothing about pandas). After all, I only had six seconds to spare…

  1. Following up on Hadley’s comment, if I use the fastLmPure function from RcppArmadillo instead of lm, my runtime for the bootstrap drops from ~21s to slightly less than 1s on my machine (lm.fit is ~2s, giving the same ratio as Hadley got.)

  2. In R, when I modify your boot function (terrible name BTW, given the boot package and boot command that come with R) I get it down to under 2s run time. All I do is replace lm() with lm.fit() and the appropriate required changes. You need to make x a matrix and note the different return structure.

  3. Hi,

    Nice post but I think that the R code could be improved by using lm.fit.
    The formula interface of the lm function in R is known to be slow, is better to use lm.fit for simulation purpose and that’s what people usually do for long simulation.

    Let’s compare for example these two interfaces :

    ## create a matrix to feed lm.fit (just as in python)
    X <- cbind(1, x)

    ### simulation using lm.fit
    boot1 <- function(n = 10000){
    b1 <- numeric(n)
    b1[1] <- coef(mod1)[2]

    for(i in 2:n) {
    residBoot <- sample(errors, replace = TRUE)
    yBoot <- yHat + residBoot
    modBoot <- lm.fit(X, yBoot)
    b1[i] <- modBoot$coefficients[2]
    }

    return(b1)
    }

    ### simulation using lm.formula
    boot2 <- function(n = 10000){
    b1 <- numeric(n)
    b1[1] <- coef(mod1)[2]

    for(i in 2:n) {
    residBoot <- sample(errors, replace = TRUE)
    yBoot <- yHat + residBoot
    modBoot <- lm(yBoot ~ x)
    b1[i] <- coef(modBoot)[2]
    }

    return(b1)
    }

    set.seed(1)
    bootB1 <- boot1()
    set.seed(1)
    bootB2 <- boot2()
    all.equal(bootB1, bootB2)
    [1] TRUE

    require(rbenchmark)
    benchmark(boot1(),
    boot2(),
    columns = c("test", "replications", "elapsed", "relative"),
    order = "elapsed",
    replications = 100)

    test replications elapsed relative
    1 boot1() 100 259.3 1.000
    2 boot2() 100 3459.8 13.343

    We have a 13 fold increase, so it should be way more faster than the python solution

  4. Totally agree that the typical user experience is what matters in benchmarks, but two quick things which might be useful to you.

    1 – R and python are both inherently slow languages because they are what is called “interpreted”. In general, you shouldn’t expect a meaningful speed difference between the two, they are two of the slowest languages out there and one is not preferred over the other for speed.

    2 – Because R and python are both so slow, a lot of their functionality is actually coded in C. That is at some point during the call to an R/python function that function calls a C function, and once they do this the code is super fast. The design principle for R and python is that 90% of typical program execution time is spent on a small portion of the program (in your case lm.fit), so this is coded in a low level language, C, that is hard to write, while the rest is in a high level language, R or python, that is easy to write, naturally slow, but able to “talk” to C code that is super fast for the difficult bits of the program. If you want fast code in either language, make sure the function/library you use is calling C code (as numpy and lm.fit do). That your high level language calls C is what matters, not which high level language you use.

      • Yes, C is the devil and confuses everyone (hence the popularity of more readable languages like R/Python). In general, if you look at the R code, you can see if it calls a C/fortran function because it will include a call to the method .C or .Fortran or .External (and one other that starts with a “.” that I am forgetting).

        Many people who do computationally intensive computing in C or python eventually learn to write C extensions for these languages, or small bits of code to handle the difficult parts, which may be worth learning if you find it’s slow a lot. However, you can get a lot faster code without doing this. Some quick performance suggestions.

        1 – In R/Matlab, never use a for loop for anything that you can avoid using it for. The fast way is to “vectorize” your code, and use the apply/sapply functions instead of a for loop (create a function that does the computation within the for loop and can call this with the apply function instead). For loops in R are generally slow and you should see significant improvements when you do this.

        2 – Profile! It’s often something simple that is slow or you can streamline an R function.

        3 – Google for packages that do what you need to do in C code, usually a library built for speed will do this (one example is when doing stepwise regression to avoid recalculating the entire decomposition when a single variable is removed/added).

      • Thanks! I knew about avoiding for() loops. I’ve tried vectorizing, but it honestly just confuses me (the documentation isn’t very helpful). Maybe they’re worth looking into.

        In Python, the Cython module allows you to make a C function of the slowest part of the code that runs in C (I think). It seems to work well.

  5. Here’s the vectorized version. Just replace your boot function with this:

    Boot2 <- function(){
    residBoot <- sample(errors, replace=F)
    yBoot <- yHat + residBoot
    modBoot <- lm(yBoot ~ x)
    coef(modBoot)[2]
    }

    system.time(bootB2 system.time( bootB1 <- boot() )
    user system elapsed
    18.593 0.318 19.013

    # Vectorized time
    system.time(bootB2 library(plyr)
    > system.time(bootB3 <- raply(1e4, Boot2()))
    user system elapsed
    19.007 0.296 19.304

    Can I ask why you pull the 2nd coefficient only?

      • That’s what I thought. Also, it looks like I made a mistake entering my comment above since the middle part got cut out, leaving incorrect code. The standard vectorized way is:

        system.time(bootB2 <- replicate(1e4, Boot2()))

        A better way (IMHO) than R's built in apply functions is to use the plyr package from @hadleywickham above. Here it is:

        library(plyr)
        system.time(bootB3 <- raply(1e4, Boot2()))

        The point I was making is that vectorizing does not increase the speed in this case. But it uses only 6 lines of code to write the Boot2() function versus 13 lines for the boot function you defined. I believe most find the vectorized version easier to read and debug.

  6. I used to think that you need to choose language and be proficient it, but in my experience, you really need both languages (Python and R). I have found it contra-productive to spend time trying to make things work in one language when you have good capabilities in the other.

  7. A nice way to get a higher level abstraction than C (vectors instead of pointers, etc) is to use rcpparmadillo (with cxxfunction). Armadillo is a high-performance linear algebra package for C++ with lots of fancy features, and rcpparmadillo is a really clean way to interface it from R, with nice type conversions etc.

  8. I agree with MikeM that you should use both languages, though I’d emphasize that they address different tasks. In my mind, Python makes a lot of sense for the very front end (acquiring your data, cleaning it up, etc), and for the very tail end (creating a production pipeline for straightforward statistical processing), but the R is for the heavy lifting: exploring the data, testing various methods and models, visualizing the data, testing the results of your Python, etc.

    You may have to do a LOT of Python programming to match an R package, if the techniques of that package are a great fit for your problem, though.

  9. Last comments and an answer which I think is fastest so far:

    1 – Vectorization not always fastest, profile first! Note this bit in your code:

    Rprof(‘test.txt’)
    # Run the bootstrapping function
    system.time( bootB1 <- boot() )
    mean(bootB1)
    Rprof(NULL)
    summaryRprof('test.txt')

    Tells you exactly what take so long, should be the first step whenever anything is slow. Note glancing it over you can see that some peoples routines are obviously going faster by sampling with replacement.

    2 – lm.fit is fine, but again if you REALLY care about speed, just profile and cut the graft. Below runs is 0.25 seconds on my machine versus 10.0 for the original, note that the .Call method here immediately calls the C function to do the matrix solving and avoids all the r code in lm.fit. This code could still be made a lot faster by just solving the linear equations and not calling the lm.fit C function which creates an R object with slots, etc.

    # generate data
    x <- 0:100
    X<-cbind(1,x)
    y <- 2*x + rnorm(101, 0, 10)

    # run the regression
    mod1 <- lm(y ~ x)
    yHat <- fitted(mod1)
    errors <- resid(mod1)

    v <-function()
    {
    residBoot <- sample(errors, replace = TRUE)
    yBoot <- yHat + residBoot
    tol = 1e-07
    z <- .Call(stats:::C_Cdqrls, X, yBoot, tol)
    return(z$coefficients[2])
    }
    system.time( res<-replicate(10000,v()))

  10. I think your post illuminates the conceptual design an usefulness R. Its a feature, not a bug, as they say.

    The call to

    modBoot <- lm(yBoot ~ x)

    was done within an n=1000 for-loop using a high level built in function. It was probably the slowest way to do it.

    But you know what? It was also a very natural and intuitive way to do it. Your residuals and coefficients all were easy to snag, and a breeze to find. So you paid a little price for this. 6 secs I think it was not too bad a price. As n or p grows, the need for slicker code will grow, and you will naturally start to implement the approaches some of the above posters noted (lm.fit, matrix algebra, eventually direct calls to C) But with n=1000 and p=1, your example is the proper way to do it.

    R makes things easy so you can focus on the work. But it does allow the flexibility for computationally expensive work to be done on low level platforms if you care to do it. But most of the time it just won't be worth it. R is a spectacular middle ground whose ease will often outweigh any speed issues it has. The proof is in the pudding. Its popularity is not growing by leaps and bounds for nothing.

    Off topic, I think I will build a neural net model in C, it will run 17 times faster than the NNET package in R. I think I will have it done by December.

    your post deserves a follow up

    • If I could +1 this comment somehow, I would.

      Also, you and another commenter hit on one of my cons that I listed about Python, namely that its statistical language just isn’t as clean as R’s is. R is incredibly simple once you get the hang of it.

      What would you suggest for a follow up? I’m not doing anything in C. That language is a mystery to me.

      • The post would be titled, “Why my first post comparing speeds of Python and R misses the point..”

        Just kidding, but the you use R is for the packages and the usability. That’s it. 10,000 people all experts in their field have written extensive code (sometimes for years of their life) and maintain it, document it well, always be backwards compatible, which is then vetted and ready for download on a whim. What would you pay for this service?
        Oh yes- its free

        Wanna parse genome data? Aisle 6
        Wanna do some time-series obscure function? Lucky you, the time series department is running a sale.
        Ready to plot in under 5 minutes? R hands down the best here.

        The knowledge base in the totality of R packages is incalculable. Plus the built-ins are great for statistical computing. This is evidenced by the recent growth in Python packages, though they tend to be less meticulously maintained (generally).

        Anyhow, here’s a little pdf that gives a little insight on vectorizing your code I found helpful. http://quanttrader.info/public/FasterRCode.pdf
        note example #2 at the end.

      • I completely agree with your point (it’s why I use R in the first place).

        I can’t help but wonder though… This post links to R-bloggers, so its naturally visited by R advocates. However, there are people who use python exclusively for data analysis (especially in bioinformatics and genomics) and everything else. I wonder what they might say? I honestly have no idea. Would be an interesting debate to watch. As it is, because this site only goes to R-bloggers, the debate is rather one-sided….

    • This is a key. An old programming proverb says, “First get it right, then make it fast.” Good programming languages allow you to do this.

      R makes it trivial to fit the regression interactively, then try it in a short loop to see if it’s working as expected, then to do the full thing in a straightforward manner, and then — once it’s arguably correct — you can start doing things to make it faster. It’s not that Python can’t do this at all, but the first several steps — some of which we’ve not even mentioned — are considerably easier in R, and if you’d decided to go another direction R would have offered you many more options do to so.

  11. at risk of devolving into contorted code vectorization, one could move all the bootstraps into a single lm.fit as below. This is 1/70th the time of the original on my machine. See https://gist.github.com/shabbychef/6296955

    set.seed(101)
    # generate data
    x <- 0:100
    y <- 2*x + rnorm(101, 0, 10)

    X<-cbind(1,x)

    # run the regression
    mod1 <- lm(y ~ x)
    yHat <- fitted(mod1)
    errors <- resid(mod1)

    boot4 <-function(n=10000)
    {
    residBoot <- matrix(sample(errors, size=n*length(errors), replace = TRUE),
    ncol=n)
    yBoot <- yHat + residBoot
    modBoot <- lm.fit(X, yBoot)
    return(modBoot$coefficients[2,])
    }
    system.time( bootB4 <- boot4() )
    mean(bootB4)

  12. I am a stats PhD student doing research in bioinformatics. I attempted migrating to python a year ago. Prototyping algorithms in python is much more efficient and productive than in R. However R is still the dominate language in statistics. It’s written by statistician, used by tons of more. When I want to compare methods I have no choice but do it in R.
    The biggest problem in R is black box. Even though it is open sourced, due to poor documentation , wired performance tweaks, old FORTRAN codes. It is very hard for user to know what exactly happened.

    My suggestion is not to tied to one language, use the easiest for your task. When you truly care about performance and reliability in scientific computation, as my me not once said, write your own code, use less packages as possible.

  13. Pingback: R vs Python: Practical Data Analysis (Nonlinear Regression) | Climate Change Ecology

  14. A note for those of who were confused as I was as to why the sample was not a subset: this form of bootstrap where you fit, resample the entire population and then add the residuals back to the fitted model is sometimes called “resampling residuals”.

  15. Next year R will integrate the abilities of pqR: http://radfordneal.wordpress.com/2013/06/30/how-pqr-makes-programs-faster-by-not-doing-things/
    Anyway I think it’s not sane to say one language is better because of its speed. I personally prefer R because of its very intuitive programming language, because of certain great abilities (graphics, Shiny, knitr,… the list is long), because of the gigantic number of packages…. But it is not forbidden to use both languages. Both R and Python are excellent. If I’d really need speed, then I would switch to Julia (or simply call Julia from R for the task which is too slow).

  16. I’ve been thinking about switching to Python for Markov Chain Monte Carlo applications. More evidence that this makes sense.

    • It would probably best R there. You’re probably doing different things than I am, but Python has PyMC which is a MCMC sampling kit designed for Bayesian inference (like JAGS or Stan). My issue with PyMC is that it can only run one chain at a time, whereas OpenBUGS, JAGS, or Stan can run several (one on each core, probably, but I haven’t done this yet). However, OpenBUGS, JAGS, and Stan are primarily used for (and limited to?) Bayesian model inference.

      If you’re doing something else, Python would probably be faster, but there are tricks in R to speed it up (but there are tricks in Python to speed that up and this comments section is R-centric).

      • It’s possible to call Stan from R, which is a positive for R. I think Andrew Gelman is trying to get people to get it able to be called from Python as well. Nothing I’m doing is urgent enough that I can’t wait until it’s available in Python.

  17. Pingback: Python and R: Is Python really faster than R? | Patient 2 Earn

  18. Halving the time for running such an algorithm, after discounting the time for starting and clean-up, does not seem to be much to me. But interesting experiment anyway.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s