# Cleaning Data and Graphing in R and Python

Python has some pretty awesome data-manipulation and graphing capabilities. If you’re a heavy R-user who dabbles in Python like me, you might wonder what the equivalent commands are in Python for dataframe manipulation. Additionally, I was curious to see how many lines of code it took me to do that same task (load, clean, and graph data) in both R and Python. (I’d like to stop the arguments about efficiency and which language is better than which here, because neither my R nor Python code are the super-efficient, optimal programming methods. They are, however, how I do things. So to me, that’s what matters. Also, I’m not trying to advocate one language over the other (programmers can be a sensitive bunch), I just wanted to post an example showing how to do equivalent tasks in each language).

First, R

```# read Data
# drop incomplete data
feeding <- subset(JapBeet_NoChoice, Consumption!='NA')
# refactor and clean
feeding\$Food_Type <- factor(feeding\$Food_Type)
feeding\$Temperature[which(feeding\$Temperature==33)] <- 35

# subset
plants <- c('Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina')
subDat <- feeding[feeding\$Food_Type %in% plants, ]

# make a standard error function for plotting
seFunc <- function(x){
se <- sd(x) / sqrt(sum(!is.na(x)))
lims <- c(mean(x) + se, mean(x) - se)
names(lims) <- c('ymin', 'ymax')
return(lims)
}

# ggplot!
ggplot(subDat, aes(Temperature, Herb_RGR, fill = Food_Type)) +
stat_summary(geom = 'errorbar', fun.data = 'seFunc', width = 0, aes(color = Food_Type), show_guide = F) +
stat_summary(geom = 'point', fun.y = 'mean', size = 3, shape = 21) +
ylab('Mass Change (g)') +
xlab(expression('Temperature '*degree*C)) +
scale_fill_discrete(name = 'Plant Species') +
theme(
axis.text = element_text(color = 'black', size = 12),
axis.title = element_text(size = 14),
axis.ticks = element_line(color = 'black'),
legend.key = element_blank(),
legend.title = element_text(size = 12),
panel.background = element_rect(color = 'black', fill = NA)
)
```

Snazzy!

Next, Python!

```# read data

# clean up
feeding = JapBeet_NoChoice.dropna(subset = ['Consumption'])
feeding['Temperature'].replace(33, 35, inplace = True)

# subset out the correct plants
keep = ['Platanus occidentalis', 'Rubus allegheniensis', 'Acer rubrum', 'Viburnum prunifolium', 'Vitis vulpina']
feeding2 = feeding[feeding['Food_Type'].isin(keep)]

# calculate means and SEs
group = feeding2.groupby(['Food_Type', 'Temperature'], as_index = False)
sum_stats = group['Herb_RGR'].agg({'mean' : np.mean, 'SE' : lambda x: x.std() / np.sqrt(x.count())})

# PLOT
for i in range(5):
py.errorbar(sum_stats[sum_stats['Food_Type'] == keep[i]]['Temperature'],
sum_stats[sum_stats['Food_Type'] == keep[i]]['mean'],
yerr = sum_stats[sum_stats['Food_Type'] == keep[i]]['SE'],
fmt = 'o', ms = 10, capsize = 0, mew = 1, alpha = 0.75,
label = keep[i])

py.xlabel(u'Temperature (\u00B0C)')
py.ylabel('Mass Change')
py.xlim([18, 37])
py.xticks([20, 25, 30, 35])
py.legend(loc = 'upper left', prop = {'size':10}, fancybox = True, markerscale = 0.7)
py.show()
```

Snazzy 2!

So, roughly the same number of lines (excluding importing of modules and libraries) although a bit more efficient in Python (barely). For what it’s worth, I showed these two graphs to a friend and asked him which he liked more, he chose Python immediately. Personally, I like them both. It’s hard for me to pick one over the other. I think they’re both great. The curious can see much my older, waaayyy less efficient, much more hideous version of this graph in my paper, but I warn you.. it isn’t pretty. And the code was a nightmare (it was pre-ggplot2 for me, so it was made with R’s base plotting commands which are a beast for this kind of graph).

## 11 thoughts on “Cleaning Data and Graphing in R and Python”

• it’s available on my website, if you go to publications, there is a link to the files and R code I used in the manuscript.

2. Thanks for the informative post- It’s good to see how people use these tools in real life. And I like your use of pandas groupby.agg with a dictionary of functions. Going to try that.

• It’s great for applying multiple functions at once.. Which I do all the time

3. Great post Nathan! That is what I was looking for, a comparison between R and Python data analysis&graphics comparison.
To make the R plot look closer to the Python one, I would have used size=5.
The Python legend has not got any caption and it is located inside the plot, which may be annoying depending where the results are pictured. In that sense, better the R legend.

• It’s about data analysis in R and Python. And it’s nearly impossible to do data analysis in Python without Pandas. It would be like arguing that this post is also as much about ggplot2 as it is about R. So… yes?

4. Why didn’t you include the import commands? It would be much smoother to derive usable code if you had.