I’m interested in Python a lot, mostly because it appears to be wickedly fast. The downside is that I don’t know it nearly as well as R, so any speed gain in computation time is more than offset by Google time trying to figure out how to do what I want. However, I’m curious as to know how much faster Python is for things like simulating data (because I have a particular interest in neutral models and data simulation). So I faked a linear regression and bootstrapped the confidence interval on the slope. Since I’m only interested in the comparison in speed for bootstrapping, that’s all I’m going to show.

First, with R:

set.seed(101) # generate data x <- 0:100 y <- 2*x + rnorm(101, 0, 10) # plot data plot(x, y) # run the regression mod1 <- lm(y ~ x) yHat <- fitted(mod1) # get the residuals errors <- resid(mod1) # make a bootstrapping function boot <- function(n = 10000){ b1 <- numeric(n) b1[1] <- coef(mod1)[2] for(i in 2:n){ residBoot <- sample(errors, replace=F) yBoot <- yHat + residBoot modBoot <- lm(yBoot ~ x) b1[i] <- coef(modBoot)[2] } return(b1) } # Run the bootstrapping function system.time( bootB1 <- boot() ) mean(bootB1)

On my system, it takes about 10.5 seconds to run this bootstrap.

In Python:

import numpy as np import pylab as py import statsmodels.api as sm import time # Create the data x = np.arange(0, 101) y = 2*x + np.random.normal(0, 10, 101) # Add the column of ones for the intercept X = sm.add_constant(x, prepend=False) # Plot the data py.clf() py.plot(x, y, 'bo', markersize=10) # Define the OLS models mod1 = sm.OLS(y, X) # Fit the OLS model results = mod1.fit() # Get the fitted values yHat = results.fittedvalues # Get the residuals resid = results.resid # Set bootstrap size n = 10000 t0 = time.time() # Save the slope b1 = np.zeros( (n) ) b1[0] = results.params[0] for i in np.arange(1, 10000): residBoot = np.random.permutation(resid) yBoot = yHat + residBoot modBoot = sm.OLS(yBoot, X) resultsBoot = modBoot.fit() b1[i] = resultsBoot.params[0] t1 = time.time() elapsedTime = t1 - t0 print elapsedTime np.mean(b1)

On my system, the elapsed time is about 4.7 seconds. That’s a *substantial* increase in speed. I can only assume that for larger simulations, the time difference would be greater. Thus, it seems like Python may be the way to go for permutational analyses, especially when data are simulated.

So what I have learned about Python so far?

**PROS:**

- Super fast
- Language is ‘relatively easy’ to learn
- Has some built in stats capabilities
- Has great symbolic math capabilities

**CONS:**

- Language is different enough that learning is going to be slow
- Statistical capabilities are sparse, and R is an easy statistical language (so far)

Overall, if Python had good stats capabilities, I’d probably switch all together. As it is, I’m considering dropping R for things like modeling and simulations just because Python is so much faster. I’m also reading Numerical Ecology right now, and I’m considering sitting down and putting together the beginnings of a multivariate ecology stats module, but I bet that’s going to be more work than I have time for. It’d be pretty cool though.

So what will you do with your extra 6 seconds?

Spend it on google trying figure out Python, probably. I also have much more complex models that take a day and a half to run in R that I may work into Python to test the speed difference in something that really does eat up a considerable amount of time.

But why not spend a similar amount of time trying to make them fast in R? Or trying to figure out a better algorithm?

Especially for this specific example, you have two choices:

(1) invest a little time in learning more R and make your code much faster

(2) invest a lot of time learning another programming language and make your code much faster.

It’s hard to give general advice about performance optimisation because so much depends on what you already know, but in general if you know some R already and don’t know (much) python, you’d be better off optimising in R.

Agreed. I think I said somewhere in the post that the time saved by running the code in Python is way overshadowed by the time it took me to figure out how to do it in Python in the first place (either R documentation is better than Python’s or I just have more experience reading it). If not, I’ll say it here for the record: took me more than six seconds to figure out how to run a linear model in Python. I mention in the cons for Python that its language for stats is pretty poor, and I was implicitly referring to model specification (to reference your previous comment). There are ‘formula-type’ interfaces for Python, but… I haven’t been overly impressed so far. Python is still lightyears behind R in terms of statistical capabilities (although I’m curious to see if they develop).

It’s a bit of an unfair comparison because you’re comparing R’s high-level lm function with Python’s low-level sm.OLS. If you use R’s low-level equivalent, lm.fit, you’ll find R is much much faster (~10s -> ~1s) on my machine.

This brings up an interesting point, and echoes a conversation I once had with someone way smarter than me about a previous post. In that post, I compared Python and R’s optimization times and found that Python was an order of magnitude faster (40 seconds in R vs. 4 seconds in Python). I was told that there are a variety of tricks one can use, provided you know them, to speed up computations in R. However, he also mentioned that the vast majority of applied users (like myself) don’t know those tricks (nor the difference between high- and low-level functions), so then the comparison is actually quite interesting because its a comparison of what the average end-user experiences. So, while it is true that I might be comparing apples to… pythons?.. in terms of model fitting functions for bootstrapping, I believe that the comparison is actually valid because it’s the standard steps the average useR might go through to bootstrap in both languages. For example, I more or less used the code from John Fox’s section on bootstrapping regressions in Applied Regression.

Your comment also brings up a second interesting point. I’ve not tried it yet, but your results would suggest that the model-fitting is the bottleneck for speed in the for() loop. I may come up with a second example of resampling or some other iterative process than doesn’t rely on model fitting and test the speed difference there.

I agree with your remarks in general, but it’s particularly problematic in this case because you compare a high-level R function designed to be robust against bad inputs to a low-level python function designed to be fast. To be fair to R, you also need to average in the cost of the one time where you provide numerically unstable input and R complains nicely but python fails silently and you spend two hours trying to track down the problem😉

For more complicated models, you’ll also need to spend more time writing the model specification in python compared to R (although there are python libraries that provide an interface much like R’s formula interface)

It should be noted that statsmodels does have a high-level interface (statsmodels.formula.api) that uses pandas and formulas. This would have provided for a fairer comparison of the higher-level regression tools. As someone who does analyses in both languages I don’t buy the argument that Python users wouldn’t use an analogous high-level interface.

I looked into the statsmodels.formula.api method intially, it was my first choice. But it seemed as though the data (X and Y) had to be in a pandas dataframe for the formula to work, it wouldn’t work with X and Y as their own objects. And since I did this last night at 11:30, I didn’t bother to mess around with figuring out how to get X and Y into the right format (I know almost nothing about pandas). After all, I only had six seconds to spare…

Following up on Hadley’s comment, if I use the fastLmPure function from RcppArmadillo instead of lm, my runtime for the bootstrap drops from ~21s to slightly less than 1s on my machine (lm.fit is ~2s, giving the same ratio as Hadley got.)

In R, when I modify your boot function (terrible name BTW, given the boot package and boot command that come with R) I get it down to under 2s run time. All I do is replace lm() with lm.fit() and the appropriate required changes. You need to make x a matrix and note the different return structure.

Hi,

Nice post but I think that the R code could be improved by using lm.fit.

The formula interface of the lm function in R is known to be slow, is better to use lm.fit for simulation purpose and that’s what people usually do for long simulation.

Let’s compare for example these two interfaces :

## create a matrix to feed lm.fit (just as in python)

X <- cbind(1, x)

### simulation using lm.fit

boot1 <- function(n = 10000){

b1 <- numeric(n)

b1[1] <- coef(mod1)[2]

for(i in 2:n) {

residBoot <- sample(errors, replace = TRUE)

yBoot <- yHat + residBoot

modBoot <- lm.fit(X, yBoot)

b1[i] <- modBoot$coefficients[2]

}

return(b1)

}

### simulation using lm.formula

boot2 <- function(n = 10000){

b1 <- numeric(n)

b1[1] <- coef(mod1)[2]

for(i in 2:n) {

residBoot <- sample(errors, replace = TRUE)

yBoot <- yHat + residBoot

modBoot <- lm(yBoot ~ x)

b1[i] <- coef(modBoot)[2]

}

return(b1)

}

set.seed(1)

bootB1 <- boot1()

set.seed(1)

bootB2 <- boot2()

all.equal(bootB1, bootB2)

[1] TRUE

require(rbenchmark)

benchmark(boot1(),

boot2(),

columns = c("test", "replications", "elapsed", "relative"),

order = "elapsed",

replications = 100)

test replications elapsed relative

1 boot1() 100 259.3 1.000

2 boot2() 100 3459.8 13.343

We have a 13 fold increase, so it should be way more faster than the python solution

Yea, that seems to be a popular solution..

Totally agree that the typical user experience is what matters in benchmarks, but two quick things which might be useful to you.

1 – R and python are both inherently slow languages because they are what is called “interpreted”. In general, you shouldn’t expect a meaningful speed difference between the two, they are two of the slowest languages out there and one is not preferred over the other for speed.

2 – Because R and python are both so slow, a lot of their functionality is actually coded in C. That is at some point during the call to an R/python function that function calls a C function, and once they do this the code is super fast. The design principle for R and python is that 90% of typical program execution time is spent on a small portion of the program (in your case lm.fit), so this is coded in a low level language, C, that is hard to write, while the rest is in a high level language, R or python, that is easy to write, naturally slow, but able to “talk” to C code that is super fast for the difficult bits of the program. If you want fast code in either language, make sure the function/library you use is calling C code (as numpy and lm.fit do). That your high level language calls C is what matters, not which high level language you use.

Thanks! I’ve heard that C is the fastest, but to honest, the language confuses me completely. Is there a way to figure out whether the function is calling C or R/Python?

Yes, C is the devil and confuses everyone (hence the popularity of more readable languages like R/Python). In general, if you look at the R code, you can see if it calls a C/fortran function because it will include a call to the method .C or .Fortran or .External (and one other that starts with a “.” that I am forgetting).

Many people who do computationally intensive computing in C or python eventually learn to write C extensions for these languages, or small bits of code to handle the difficult parts, which may be worth learning if you find it’s slow a lot. However, you can get a lot faster code without doing this. Some quick performance suggestions.

1 – In R/Matlab, never use a for loop for anything that you can avoid using it for. The fast way is to “vectorize” your code, and use the apply/sapply functions instead of a for loop (create a function that does the computation within the for loop and can call this with the apply function instead). For loops in R are generally slow and you should see significant improvements when you do this.

2 – Profile! It’s often something simple that is slow or you can streamline an R function.

3 – Google for packages that do what you need to do in C code, usually a library built for speed will do this (one example is when doing stepwise regression to avoid recalculating the entire decomposition when a single variable is removed/added).

Thanks! I knew about avoiding for() loops. I’ve tried vectorizing, but it honestly just confuses me (the documentation isn’t very helpful). Maybe they’re worth looking into.

In Python, the Cython module allows you to make a C function of the slowest part of the code that runs in C (I think). It seems to work well.

Here’s the vectorized version. Just replace your boot function with this:

Boot2 <- function(){

residBoot <- sample(errors, replace=F)

yBoot <- yHat + residBoot

modBoot <- lm(yBoot ~ x)

coef(modBoot)[2]

}

system.time(bootB2 system.time( bootB1 <- boot() )

user system elapsed

18.593 0.318 19.013

# Vectorized time

system.time(bootB2 library(plyr)

> system.time(bootB3 <- raply(1e4, Boot2()))

user system elapsed

19.007 0.296 19.304

Can I ask why you pull the 2nd coefficient only?

Mostly because I was just playing around.. I had no real interest in either coefficient since I knew them when I made up the data.

That’s what I thought. Also, it looks like I made a mistake entering my comment above since the middle part got cut out, leaving incorrect code. The standard vectorized way is:

system.time(bootB2 <- replicate(1e4, Boot2()))

A better way (IMHO) than R's built in apply functions is to use the plyr package from @hadleywickham above. Here it is:

library(plyr)

system.time(bootB3 <- raply(1e4, Boot2()))

The point I was making is that vectorizing does not increase the speed in this case. But it uses only 6 lines of code to write the Boot2() function versus 13 lines for the boot function you defined. I believe most find the vectorized version easier to read and debug.

I used to think that you need to choose language and be proficient it, but in my experience, you really need both languages (Python and R). I have found it contra-productive to spend time trying to make things work in one language when you have good capabilities in the other.

A nice way to get a higher level abstraction than C (vectors instead of pointers, etc) is to use rcpparmadillo (with cxxfunction). Armadillo is a high-performance linear algebra package for C++ with lots of fancy features, and rcpparmadillo is a really clean way to interface it from R, with nice type conversions etc.

I agree with MikeM that you should use both languages, though I’d emphasize that they address different tasks. In my mind, Python makes a lot of sense for the very front end (acquiring your data, cleaning it up, etc), and for the very tail end (creating a production pipeline for straightforward statistical processing), but the R is for the heavy lifting: exploring the data, testing various methods and models, visualizing the data, testing the results of your Python, etc.

You may have to do a LOT of Python programming to match an R package, if the techniques of that package are a great fit for your problem, though.

Last comments and an answer which I think is fastest so far:

1 – Vectorization not always fastest, profile first! Note this bit in your code:

Rprof(‘test.txt’)

# Run the bootstrapping function

system.time( bootB1 <- boot() )

mean(bootB1)

Rprof(NULL)

summaryRprof('test.txt')

Tells you exactly what take so long, should be the first step whenever anything is slow. Note glancing it over you can see that some peoples routines are obviously going faster by sampling with replacement.

2 – lm.fit is fine, but again if you REALLY care about speed, just profile and cut the graft. Below runs is 0.25 seconds on my machine versus 10.0 for the original, note that the .Call method here immediately calls the C function to do the matrix solving and avoids all the r code in lm.fit. This code could still be made a lot faster by just solving the linear equations and not calling the lm.fit C function which creates an R object with slots, etc.

# generate data

x <- 0:100

X<-cbind(1,x)

y <- 2*x + rnorm(101, 0, 10)

# run the regression

mod1 <- lm(y ~ x)

yHat <- fitted(mod1)

errors <- resid(mod1)

v <-function()

{

residBoot <- sample(errors, replace = TRUE)

yBoot <- yHat + residBoot

tol = 1e-07

z <- .Call(stats:::C_Cdqrls, X, yBoot, tol)

return(z$coefficients[2])

}

system.time( res<-replicate(10000,v()))

I think your post illuminates the conceptual design an usefulness R. Its a feature, not a bug, as they say.

The call to

modBoot <- lm(yBoot ~ x)

was done within an n=1000 for-loop using a high level built in function. It was probably the slowest way to do it.

But you know what? It was also a very natural and intuitive way to do it. Your residuals and coefficients all were easy to snag, and a breeze to find. So you paid a little price for this. 6 secs I think it was not too bad a price. As n or p grows, the need for slicker code will grow, and you will naturally start to implement the approaches some of the above posters noted (lm.fit, matrix algebra, eventually direct calls to C) But with n=1000 and p=1, your example is the proper way to do it.

R makes things easy so you can focus on the work. But it does allow the flexibility for computationally expensive work to be done on low level platforms if you care to do it. But most of the time it just won't be worth it. R is a spectacular middle ground whose ease will often outweigh any speed issues it has. The proof is in the pudding. Its popularity is not growing by leaps and bounds for nothing.

Off topic, I think I will build a neural net model in C, it will run 17 times faster than the NNET package in R. I think I will have it done by December.

your post deserves a follow up

If I could +1 this comment somehow, I would.

Also, you and another commenter hit on one of my cons that I listed about Python, namely that its statistical language just isn’t as clean as R’s is. R is incredibly simple once you get the hang of it.

What would you suggest for a follow up? I’m not doing anything in C. That language is a mystery to me.

The post would be titled, “Why my first post comparing speeds of Python and R misses the point..”

Just kidding, but the you use R is for the packages and the usability. That’s it. 10,000 people all experts in their field have written extensive code (sometimes for years of their life) and maintain it, document it well, always be backwards compatible, which is then vetted and ready for download on a whim. What would you pay for this service?

Oh yes- its free

Wanna parse genome data? Aisle 6

Wanna do some time-series obscure function? Lucky you, the time series department is running a sale.

Ready to plot in under 5 minutes? R hands down the best here.

The knowledge base in the totality of R packages is incalculable. Plus the built-ins are great for statistical computing. This is evidenced by the recent growth in Python packages, though they tend to be less meticulously maintained (generally).

Anyhow, here’s a little pdf that gives a little insight on vectorizing your code I found helpful. http://quanttrader.info/public/FasterRCode.pdf

note example #2 at the end.

I completely agree with your point (it’s why I use R in the first place).

I can’t help but wonder though… This post links to R-bloggers, so its naturally visited by R advocates. However, there are people who use python exclusively for data analysis (especially in bioinformatics and genomics) and everything else. I wonder what they might say? I honestly have no idea. Would be an interesting debate to watch. As it is, because this site only goes to R-bloggers, the debate is rather one-sided….

This is a key. An old programming proverb says, “First get it right, then make it fast.” Good programming languages allow you to do this.

R makes it trivial to fit the regression interactively, then try it in a short loop to see if it’s working as expected, then to do the full thing in a straightforward manner, and then — once it’s arguably correct — you can start doing things to make it faster. It’s not that Python can’t do this at all, but the first several steps — some of which we’ve not even mentioned — are considerably easier in R, and if you’d decided to go another direction R would have offered you many more options do to so.

at risk of devolving into contorted code vectorization, one could move all the bootstraps into a single lm.fit as below. This is 1/70th the time of the original on my machine. See https://gist.github.com/shabbychef/6296955

set.seed(101)

# generate data

x <- 0:100

y <- 2*x + rnorm(101, 0, 10)

X<-cbind(1,x)

# run the regression

mod1 <- lm(y ~ x)

yHat <- fitted(mod1)

errors <- resid(mod1)

boot4 <-function(n=10000)

{

residBoot <- matrix(sample(errors, size=n*length(errors), replace = TRUE),

ncol=n)

yBoot <- yHat + residBoot

modBoot <- lm.fit(X, yBoot)

return(modBoot$coefficients[2,])

}

system.time( bootB4 <- boot4() )

mean(bootB4)

I am a stats PhD student doing research in bioinformatics. I attempted migrating to python a year ago. Prototyping algorithms in python is much more efficient and productive than in R. However R is still the dominate language in statistics. It’s written by statistician, used by tons of more. When I want to compare methods I have no choice but do it in R.

The biggest problem in R is black box. Even though it is open sourced, due to poor documentation , wired performance tweaks, old FORTRAN codes. It is very hard for user to know what exactly happened.

My suggestion is not to tied to one language, use the easiest for your task. When you truly care about performance and reliability in scientific computation, as my me not once said, write your own code, use less packages as possible.

If you’re looking for readable performant scientific code you might also want to take a look at the Julia language!

I see a lot of replies saying that you could speed up R… that’s true but you can speed up Python as well! In general the trick is never use loops but work directly on the numpy array. Have a look to pandas, than you can speed it up even more using Cython (http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html). For statistics you might me interested in statsmodel http://statsmodels.sourceforge.net/.

Yep, I know about stats models. Its good, but it is limited. I have a new post today coming out about that

Pingback: R vs Python: Practical Data Analysis (Nonlinear Regression) | Climate Change Ecology

A note for those of who were confused as I was as to why the sample was not a subset: this form of bootstrap where you fit, resample the entire population and then add the residuals back to the fitted model is sometimes called “resampling residuals”.

Yes, and I caught another mistake after I posted that no one has brought up. Technically, in this form of bootstrapping, I’d sample with replacement, not without.

Next year R will integrate the abilities of pqR: http://radfordneal.wordpress.com/2013/06/30/how-pqr-makes-programs-faster-by-not-doing-things/

Anyway I think it’s not sane to say one language is better because of its speed. I personally prefer R because of its very intuitive programming language, because of certain great abilities (graphics, Shiny, knitr,… the list is long), because of the gigantic number of packages…. But it is not forbidden to use both languages. Both R and Python are excellent. If I’d really need speed, then I would switch to Julia (or simply call Julia from R for the task which is too slow).

I’ve been thinking about switching to Python for Markov Chain Monte Carlo applications. More evidence that this makes sense.

It would probably best R there. You’re probably doing different things than I am, but Python has PyMC which is a MCMC sampling kit designed for Bayesian inference (like JAGS or Stan). My issue with PyMC is that it can only run one chain at a time, whereas OpenBUGS, JAGS, or Stan can run several (one on each core, probably, but I haven’t done this yet). However, OpenBUGS, JAGS, and Stan are primarily used for (and limited to?) Bayesian model inference.

If you’re doing something else, Python would probably be faster, but there are tricks in R to speed it up (but there are tricks in Python to speed that up and this comments section is R-centric).

It’s possible to call Stan from R, which is a positive for R. I think Andrew Gelman is trying to get people to get it able to be called from Python as well. Nothing I’m doing is urgent enough that I can’t wait until it’s available in Python.

Great post and lots of great comments!

I don’t know a whole lot about Python, but R is very easy to program in parallel. I ran your R code in 4 seconds on 4 cores: http://practicalrvideos.blogspot.com/2013/08/multicore-parallel-processing-in-r.html. It required almost no modification of your code and could be scaled upwards to any number of available cores!

Pingback: Python and R: Is Python really faster than R? | Patient 2 Earn

Halving the time for running such an algorithm, after discounting the time for starting and clean-up, does not seem to be much to me. But interesting experiment anyway.

Have you tried using scikit-learn for your model-fitting needs? I did a brief scan through your posts and comments in them and was surprised to not see a mention of it.

I have not, but only because I’ve not yet run into a problem that requires it. I’ll probably need it for nMDS in the future, though.

old, but nice post. I love R! Working with data where I receive 200 GB / week, I spend 2 years trying to make R faster by using Rcpp. I resisted Python (Pandas) a lot, particularly because its PyData breaks down trying to write data frames that have columns with a mixture of numbers and strings, or write out massively large size files – total deal breaker compared to R’s RData binary format. But it came to a point, where it wasn’t scalable enough to work with R anyone. I was writing so much of Rcpp codes that I felt I should just move to C++. After I ended up writing my own custom binary format for Pandas data-frame (which I can also read back in R), I started rewriting my codes in R and found a speed up of almost 500x (my work is much less of statistics, but more about data manipulations). The only downside of Pandas is that it does not have integer NaN and converts integer columns with NaN to float. Still think as a language R is beautiful language, but need to get practical sometimes.

I’ve been meaning to make a post about R and Python again. All the R users are in love with Hadley’s dplyr package because of its grouping and intuitive stringing of commands. But I don’t think any of them realize it’s just an R port of the way Python/Pandas has handled data frames for years

I can speak somewhat about the hedge fund / investment world which relies more on data manipulations rather than statistics. I think the R vs Python debate currently is similar to the historic R vs Matlab and R vs Stata debates. In Matlab it used to take an incredibly long to read super large CSV files versus R (almost 1000x times) and Matlab’s cell structure didn’t have column names plus took too much size in memory causing out-of-memory issues (a year or two ago Matlab did release its own data frame structure called tables, but its too late). People resisted shifting from Matlab to R, but eventually a lot did. Matlab does have amazingly easy parallel processing functionally for Windows that R lacks and was one reason for the initial resistance. Similar thing happened in parallel with R vs Stata in the investment world. Stata’s dta data format is probably still faster than R’s RData binary format, but ability to write custom functions in Stata’s was super painful, and Stata used to have many limitations, such as no toolbox for optimization (linear programming etc.) and as a result a lot of people moved from Stata to R. With Pandas maturing quite significantly, I think a lot of R people will gradually move to Python for the speed up (a 6 hour real production task in R takes around 20 mins in Python). Folks who are resisting the moves are in love with R’s syntax, which I admittedly find better, but the speed of Python for real world stuff beats (I just hope someday someone with time on their hand creates a branch of numpy/pandas with Integer NaN).

I believe that Python code in line 38 should read:

residBoot = np.random.choice(resid, size=len(resid), replace=True)

insted of:

residBoot = np.random.permutation(resid)

due to the fact that bootstraping samples are generated from the original sample with replacement and not by permuting the original sample. Check the documentation for choice vs permutations.