Citation Patterns in Ecology

I’m always curious to see who is citing my one paper. Turns out I actually have two papers, and the most cited paper (with 19 citations, which sounds paltry but for me is quite exciting) is certainly not the one I’d have expected. By any stretch of the imagination. But, my second paper has only been in print for about six months. So I began to wonder: How do I, a third year grad student, stack up against other grad students, professors, and ecology rock stars? What is the key to having your work well cited?

Since I’ve been learning Python, I realized I could figure this out pretty easily. I wrote a Python script to run names through Google Scholar, look for a user profiles (if it exists), go to the the profile page, and extract the data from the “Citations by Year” chart. This chart presents how many times your papers have been cited in a year (it’s not cumulative). I’d attach the Python script if I felt like spending the time figuring out how to upload the text file to WordPress (but I don’t).

I ran 20 names through the program and downloaded the number of citations per year for each. I plotted them out in R and ran some super basic analyses and found that it’s surprisingly consistent and highly variable. Number of citations increased allometrically across all authors. However, the slopes varied significantly among authors. In the graphs, values are plotted as lag since date of first citation (which is set to 1).

rawCites

FYI: Getting the y-axis to display ticks in exponential form was a bit tricky.

p +
geom_point( aes(lag + 1, cites + 0.1, color=name ), size=3, show_guide=F ) +
scale_y_continuous(
trans=log2_trans(),
breaks = c(0.1, 1, 10, 100, 1000),
labels = trans_format('log10', math_format(10^.x))
) +
scale_x_continuous(
trans = log2_trans()
) +
ylab('Number of Citations per Year') +
xlab('Years Since First Citation') +
theme(
axis.title = element_text(size=14),
axis.text = element_text(size=12, color='black')
)

I was a curious to know if there were differences between male and female ecologists, so I separated it out based on sex. The allometric relationship still holds for females but the number of citations per year for females increases more rapidly than it does for males.

mfCitesGranted, there a huge number of problems here. I’ve not standardized by number of publications or any other variable. However, that would only serve to reduce the noise. It seems that the key to being heavily cited is longevity. Of course, extrapolating beyond these lines would be dangerous (at 64 years, one could expect 10,000 citations per year!)

Anyway, I thought this was interesting (I was more interested in getting my Python script to work, which it did). Thoughts?

I’m also willing to share the author list with the curious, but I kept it hidden to avoid insulting people who aren’t on the list (it was really a random sample of people I know and people whose papers I’ve read, which explains the slight bias towards marine ecology and insects).

UPDATE I had a request for the Python script, so here it is.

from urllib import FancyURLopener
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re

# Make a new opener w/ a browser header so Google allows it
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
scholar_url = 'http://scholar.google.com/scholar?q=(query)&btnG=&hl=en&as_sdt=0%2C10'

# Define the search function for the google chart string
def findBetween(s, first, last):
    start = s.index(first) + len(first)
    end = s.index(last)
    return( s[start:end])

# Define the search for text within a string till the end othe string
def getX(s):
    start = s.index('chxl=0:') + len('chxl=0:')
    return( s[start:] )

# Define the function to actually get the chart data
def scholarCiteGet(link):
    # Navigate to and parse the user profile
    citLink2 = link.get('href')
    s2 = 'http://scholar.google.com' + citLink2
    socket = myopener.open(s2)
    wsource2 = socket.read()
    socket.close()
    soup2 = BeautifulSoup(wsource2)

    # Find the chart image and encode the string URL
    chartImg = soup2.find_all('img')[2]
    chartSrc = chartImg['src'].encode('utf=8')

    # Get the chart y-data from the URL
    chartD = findBetween(chartSrc, 'chd=t:', '&chxl')
    chartD = chartD.split(',')
    chartD = [float(i) for i in chartD]
    chartD = np.array(chartD)

    # Get the chart y-conversion
    ymax = findBetween(chartSrc, '&chxr=', '&chd')
    ymax = ymax.split(',')
    ymax = [float(i) for i in ymax]
    ymax = np.array( ymax[-1] )
    chartY = ymax/100 * chartD

    # Get the chart x-data
    chartX = getX(chartSrc)
    chartX = chartX.split('|')
    chartX = int(chartX[1])
    chartX = np.arange(chartX, 2014)
    chartTime = chartX - chartX[0]

    # put the data together and return a dataframe
    name = soup2.title.string.encode('utf-8')
    name = name[:name.index(' - Google')]
    d = {'name':name, 'year':chartX, 'lag':chartTime, 'cites':chartY}

    citeData = pd.DataFrame(d)

    return(citeData)

def scholarNameGet(name):

    # Navigate and parse the google scholar page with the search for the name specified
    name2 = name.replace(' ', '%20')
    s1 = ( scholar_url.replace('(query)', name2) )
    socket = myopener.open(s1)
    wsource1 = socket.read()
    socket.close()
    soup1 = BeautifulSoup(wsource1)

    # Get the link to the user profile
    citText = soup1.find_all(href=re.compile('/citations?') )

    if 'mauthors' in str(citText):
        citLink = citText[2]
        temp = scholarCiteGet(citLink)
        return(temp)
    else:
        citLink = citText[1]

    # If the link is to a user profile... get the data
    if 'User profiles' in str(citLink):
        temp = scholarCiteGet(citLink)
        return(temp)

    # If not, return 'no data'
    else:
        d = {'name':name, 'year':'No Data', 'lag':'No Data', 'cites':'No Data'}
        temp = pd.DataFrame(d, index=[0])
        return(temp)

# Run getCites once to populate the dataframe
finalDat = pd.DataFrame()

# Insert list of names here
sciNames = []

for name in sciNames:
    a = scholarNameGet(name)
    finalDat = pd.concat([finalDat, a])

plotDat = finalDat.pivot(index = 'lag', columns = 'name', values = 'cites')
plotDat = plotDat.replace('No Data', np.nan)
plotDat.plot()

Python Complements R’s Shortcomings

I’m a big fan of open-source software for research. For example, R-statistics, Qgis, and Grass GIS are awesome programs. R can do any statistical tests and numerical modeling you can imagine; if there’s not a built-in function you can write one (the beauty of using a programming language over point-and-click statistical programs). Granted, SPSS and Minitab do allow you to write scripts, but the language is less straightforward than R, and matrix-manipulation is easier in R. Same with Grass GIS. Qgis is an amazing substitute for ArcMap, with a beautiful interface. The other benefit of using a programming language for your research is reproducibility. You can save your code, re-run it to generate the same exact tests and figures so that you never forget what you did. Plus, you can share your code as appendices on your paper so that others can read your code to reproduce your results exactly. There is absolute transparency. This can be incredibly important, as noted in a post by Jeremy Fox over at Dynamic Ecology.

However, as wonderful as R is, there are things R does not do well. Symbolic math is one of them (R CAN do it, but I find the commands clunky). However, the Python programming language (which is a high-level language that can be used to make stand-alone software) has excellent modules (like R packages) for matrix manipulations and symbolic math. Although I’m still new to it, I dare say that Python’s mathematical package is near as complete as MATLAB (at least for my uses). In fact, Python’s numerical packages are widely used in physics, astronomy, and bioinformatics.

Python makes use of three main modules for math: scipy, numpy, and sympy. Scipy and Numpy together make Python a powerful MATLAB-esque program, while Sympy adds capabilities for symbolic math. Matplotlib (and by association Pylab) provide excellent graphing capabilities. Another benefit is that the language is similar to (but not identical to) R, so that transitioning should be easy. R does have a python-language interface package, but it is slow and you still need to learn Python. So why not just run Python directly. Additionally, Spyder provides an excellent RStudio-esque IDE for Python (although I think it’s Mac only). UPDATE: Spyder apparently works on both Windows and Linux. My bad.

I can’t walk you through the installation of Python and its modules (there’s a huge amount of material already available, and if you use a Mac, I highly recommend the MacPorts installation route). But I can show you what it can do. I developed a Python script to model Lotka-Volterra competition equations. Here’s the annotated script with the figure output, just as a demonstration of the capabilities of Python’s symbolic math package Sympy as an excellent complement to R’s highly advanced numerical techniques. Interested folks can look up Intro to Python material to get a handle on the language, although if you know R, it should be easy enough to figure out. Same goes for those of you familiar with MATLAB. The biggest difference is that when you import a module, you need to type the module name (or nickname if you use ‘as’) before the command so that Python knows where to look for the command. (NOTE: I’ve verified all of Python Sympy’s results with MATLAB. Although they give different equations in the output, they’re identical albeit with some algebraic manipulation).

# Import the necessary modules (like R packages)
# LOTKA-VOLTERRA COMPETITION
import sympy as sm
from scipy import integrate
import pylab as py
import numpy as np

# First, define the variables as symbols (like MATLAB)
r1, N1, K1, a12, r2, N2, K2, a21 = sm.symbols('r1, N1, K1, a12, r2, N2, K2, a21')

# Define the Lotka-Volterra differential equation symbolically
sp1Diff = r1*N1*(1 - (N1 + a12*N2)/K1)
sp2Diff = r2*N2*(1 - (N2 + a21*N1)/K2)

# Make the equality, setting the equations equal to zero
sp1Mod = sm.Eq( sp1Diff, 0 )
sp2Mod = sm.Eq( sp2Diff, 0 )

# Solve each equation for the relevant species (equilibrium values)
sp1Eq = sm.solve( sp1Mod, N1 )
sp2Eq = sm.solve( sp2Mod, N2 )

print 'The equilibrium population size for species 1 is\n', sp1Eq
print 'The equilibrium population size for species 2 is\n', sp2Eq

# Above, we've used sympy to symbolically find the equilibrium
# population sizes for both N1 and N2. The code is very straight-
# forward and simple (compare to R's symbolic math abilities).

# Unfortunately, the equilibria depend on the population size of
# the other species. Assume that both species are present.
# Make a system of linear equations using the two equilibria,
# solving for N1 and N2.

sys1 = sm.Eq( K1 - N1 - N2*a12, 0 )
sys2 = sm.Eq( K2 - N2 - N1*a21, 0 )

# Solve this system for N1 and N2, giving the equilibrium
# population size when both species are present

sysSolve = sm.solve( (sys1, sys2), N1, N2 )
print 'The equilibrium population size for each species in the presence of the other is\n', sysSolve

# Next, we use scipy to solve the Lotka-Volterra model numerically
# given a set of chosen parameter values (this can be done in R
# easily, but we're on a roll and it's also very easy in Python.
# Arguably easier, actually.

# Note that arrays in Python start with 0, so that K[0] in Python
# is equivalent to K[1] in R (K[1] = K[2] in R, etc.).
# I prefer R's technique here, but
# I'm sure there's a computer-sciencey reason for using zero-based
# arrays.
def lvComp(N, t, r, K, alpha):
    return(
    r[0] * N[0] * (1 - (N[0]+alpha[0]*N[1])/K[0]) ,
    r[1] * N[1] * (1 - (N[1] + alpha[1]*N[2])/K[1])
    )

# Now, set the parameter values. You can vary these later
r = [0.5, 0.2]
N = [2, 2]  # Initial pop size
K = [100, 50]
alpha = [1.2, 0.2]

# Set the duration of the model
tmax = 100
t = range(0, tmax + 1)
# Range goes from the starting point to n-1, n is the end point

# Now here it gets complicated! Solve the ode
Nt = integrate.odeint( lvComp, N, t, args = (r, K, alpha) )
# Done!

# Get the equilibrium pop sizes by plugging in our chosen parameter
# values into the equilibria calculated above. This is why I
# didn't use R to solve the ODE. I've already calculated the
# symbolic equilibrium, and substituting my parameters in is easy.
sp1EqLine = sysSolve[N1].subs( [ (K1,K[0]), (K2, K[1]), (a12,alpha[0]), (a21, alpha[1]) ], simultaneous=True)
sp2EqLine = sysSolve[N2].subs( [ (K1,K[0]), (K2, K[1]), (a12,alpha[0]), (a21, alpha[1]) ], simultaneous=True)

# Make a vector of equilibria
plotequils = (sp1EqLine, sp2EqLine)

# Plot the ode results
py.plot(t, Nt, '--o')
py.xlabel('time (t)')
py.ylabel('N(t)')
py.legend(['Sp1', 'Sp2'], loc='lower right')
py.title('Lotka-Volterra Competition Between Two Species')

for line in plotequils:
    py.axhline(y=line, linewidth=2, color='red', ls='--')

ODEresults

# To print zero-growth isoclines, first isolate the equilibrium
# equation for each species
sp1IsoEq = sp1Eq[1]
sp2IsoEq = sp2Eq[1]

# We're plotting a phase-plane graph with species 2 on the y-axis
# so we need to re-arrange the species 1 equation to predict N2
# based on N1 (it's currently the other way around)
sp1IsoEq = sm.Eq(sp1IsoEq, N1)
sp1IsoEq = sm.solve(sp1IsoEq, N2)

# Get the range of N1 on the x-axis. Depending on the parameter
# values, the max can either be K1, or K2/a21
n1Range = range(0, int( max(K[0], K[1]/alpha[1]) + 5))

# Calculate the zero-growth isocline for species 1
n1Iso = [] #makes an empty list

for n in n1Range:
    a = sp1IsoEq[0].subs([(K1,K[0]),(N1,n1Range[n]),(a12,alpha[0])], simultaneous=True)
    n1Iso.append(a)

# Calculte the zero-growth isocline for species 1
n2Iso = []
for n in n1Range:
    a = sp2IsoEq.subs( [ (K2,K[1]), (N1, n1Range[n]), (a21, alpha[1]) ], simultaneous=True )
    n2Iso.append(a)

# Plot the results, showing the model output overtop
py.clf()
py.plot(n1Range, n1Iso, '--')
py.plot(n1Range, n2Iso, '--')
py.plot(Nt[:,0], Nt[:,1], 'b^', label='Model Output')
py.axis( [0, max(K[0], K[1]/alpha[1])+0.1*max(K[0], K[1]/alpha[1]), 0, max(K[1], K[0]/alpha[0])+0.1*max(K[1], K[0]/alpha[0])] )
py.xlabel('N1')
py.ylabel('N2')
py.legend( ['N1', 'N2', 'Model Output'], loc='upper right' )
# Get text value
py.text(0.5, K[1], 'K2', va='center')
py.text(0.5, K[0]/alpha[0], 'K1/a12', va='center')
py.text(K[0], 0.5, 'K1', ha='center')
py.text(K[1]/alpha[1], 0.5, 'K2/a21', ha='center')

IsoPlot

Normally, we could stop here. But we’ll go one further and plot a vector field (something else python makes easy)

# First, set up a grid of N1 and N2 values (similar to
# expand.grid() in R)
X,Y = np.meshgrid(
    np.arange(0, K[0] + 0.1*K[0], 5), # 0 - K1, every 5th number
    np.arange(0, K[1] + 0.1*K[1], 5)
    )

# Python wants a starting X,Y position (above) and vectors of
# of change along the X and Y axes (U,V). These are what the
# differential equations describe, so we can just use the very
# first equations over every element in X and Y
U = r[0] * X * (1 - (X + alpha[0]*Y)/K[0])
V = r[1] * Y * (1 - (Y + alpha[1]*X)/K[1])

# plot the results
py.clf()
py.plot(Nt[:,0], Nt[:,1], 'go')
Q = py.quiver(X,Y,U,V)
py.plot(sp1EqLine, sp2EqLine, 'b*', ms=15)
py.plot(0, K[1], 'b*', ms=15)
py.plot(K[0], 0, 'b*', ms=15)
py.axis( [-1, K[0] + 0.1*K[0], -1, K[1] + 0.1*K[1] ] )
py.xlabel('N1')
py.ylabel('N2')
py.title('Vector Field of Population Trajectories')

VectorField

I can’t advocate Python over R for statistics, R is too well defined, has too many functions, and has much better model formulation (to name a very few of the long list of reasons R is better for stats). But hopefully I’ve convinced you that Python is a great platform for symbolic math, calculus, and modeling that can complement R. Together, R and Python make the world’s best open-source mathematical software combination. Also, as with R, if you want it, there’s a Python module for it (GIS, GPS tracking, astronomical predictions, physics, etc). Also, Python is a high-level programming languaging, meaning that it’s possible to integrate your math with other Python modules and programs meant for things like information gathering from websites, etc.

(NOTE: Sage exists as an open-source alternative to MATLAB, Maple, and Mathematica combined and is excellent, but it’s largely built on Python and uses many of the same modules. I’m not sure where Sage extends the capabilities of Numpy and Scipy, but definitely check it out. It’s possible that Sage expands on Numpy, Scipy, and Sympy)