Piecewise regression comes about when you have ‘breakpoints’, where there are clearly two different linear relationships in the data with a sudden, sharp change in directionality. This crops up occasionally in ecology when dealing with, for example, species richness of understory plants and forest age. There is initially a rapid drop as succession progresses (rapid colonizing plants die as light becomes more limiting), then a sudden breakpoint and a more moderately paced rise in species richness as more shade tolerant plants and seedlings of older trees begin to colonize the mature forest.

If you Google ‘R piecewise regression’, you may get a variety of methods and advice on how to run a piecewise regression. Essentially, you can do it manually or use a canned package to run the regression. If you’re like me, you might be wondering: Are these methods are the same? If not, which is best? How do I actually use them? Well, hopefully I’ll answer some of these questions.

There are essentially two methods I’ll walk through: brute force iterative approaches (as in ‘The R Book’ by Crawley) and the ‘segmented’ package. Using these approaches allows you to statistically estimate the breakpoint, which is better than just eyeballing it and fitting two models around what you *think* is the breakpoint. Let statistics do the work for you objectively.

First, let’s generate some segmented data:

x <- c(1:10, 13:22) y <- numeric(20) ## Create first segment y[1:10] <- 20:11 + rnorm(10, 0, 1.5) ## Create second segment y[11:20] <- seq(11, 15, len=10) + rnorm(10, 0, 1.5) ## Plot it par(mar=c(4,4,1,1)+0.2) plot(x,y, ylim=c(5, 20), pch=16)

You should get a V-shaped plot, where the right side is a little shallower sloped.

**METHOD 1: ITERATIVE SEARCHING**

First, we’ll model the data using the iterative search procedure described by Crawley. We’re going to use a model that looks like this:

In this case, *c* is the breakpoint. *I*(*x < c*) and *I*(*x > c*) are essentially dummy variables. *I*(*x < c*) is 1 if x is less than the breakpoint and 0 if it is above. *I*(*x > c*) is a dummy variable that is 1 if x is above the breakpoint and 0 if it is below. It should be obvious that there are two sets of parameters being modeled, depending on the value of x.

The complicated bit is choosing the breakpoint. We can eyeball the data and say that the breakpoint is somewhere between 9 and 17. Choose a wider range than you might think, just to be safe. Create a variable called breaks to hold these breakpoints:

breaks <- x[which(x >= 9 & x <= 17)]

Now we’re going to iteratively search these breakpoints for the model that has the lowest residual MSE, using that as our criteria for the best model. Create an empty container for MSE values from each model, and use a for( ) loop to run a linear regression for each possible breakpoint. Formulate the linear model exactly like the above formula.

mse <- numeric(length(breaks)) for(i in 1:length(breaks)){ piecewise1 <- lm(y ~ x*(x < breaks[i]) + x*(x>=breaks[i])) mse[i] <- summary(piecewise1)[6] } mse <- as.numeric(mse)

If we plot MSE by breakpoints, we can visually estimate the breakpoint as the lowest point on the curve:

It could easily be either 13 or 15, so I just pick the breakpoint with the lowest error:

breaks[which(mse==min(mse))]

This tells me that the breakpoint in my simulation was 15. I run my final model with my newly estimated breakpoint:

piecewise2 <- lm(y ~ x*(x < 15) + x*(x > 15)) summary(piecewise2)

The summary looks like this:

This is a pretty nasty output. Read it like this: (Intercept) is more or less useless on its own. The intercept for the line when x < 15 is (Intercept) + x < 15TRUE, or 3.3133 + 16.6352 = 19.9485. The slope of the line when x < 15 is x + x:x<15TRUE, 0.5843 – 1.3025 = -0.7182. So, when x is less than 15, the formula is 19.9485 – 0.7182x.

When x is greater than 15, the intercept is 3.3133 – 0.9116 = 2.4017 and the slope is just the variable x, 0.5843. So when x is greater than 15, the line has the equation 2.4017 + 0.5843x. Notice this table excludes the coefficients with NA estimates. Those arise from singularities and can be ignored.

plot(x,y, ylim=c(5, 20), pch=16) curve((3.3133 + 16.6352) + (0.5843-1.3025)*x, add=T, from=1, to=15) curve((3.3133 - 0.9116) + 0.5843*x, add=T, from=15, to=max(x)) abline(v=15, lty=3)

Notice that the segments **were not** constrained to be touching or continuous. This is inherent in the algorithm that we used.

**METHOD 2: USE ‘SEGMENTED’ PACKAGE**

To use the ‘segmented’ package you will, of course, need to install and load it. The procedure of ‘segmented’ uses maximum likelihood to fit a somewhat different parameterization of the model:

*I*(*x > c*) is a dummy variable as above, so when x < c, the model is essentially:

The *γ* term is simply a measure of the distance between the end of the first segment and the beginning of the next. The model converges when *γ *is minimized, thus this method **constrains the segments to be (nearly) continuous**. This is a *major* difference from the iterative approach in Method 1 above.

To use this method, you first fit a generic linear model. You then use the segmented( ) function to fit the piecewise regression. The segmented( ) function takes for its arguments the generic linear model, seg.Z which is a one sided formula describing the predictor with a segment (we only have one predictor, x, which has the segment), and psi, which is a starting value of the breakpoint (as in nls( ), you need to supply a best-guess estimate). More complicated models are a bit more complicated in terms of arguments, but this is a good starting example.

In our case, x is the predictor with a segment (it’s the only predictor) and based on my very first scatterplot (the first graph on the page), I’m going to guess the breakpoint is 14.

lin.mod <- lm(y~x) segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

Using the summary(segmented.mod) command, you will get the estimated breakpoint +/- some error (12.07 for me), the intercept and slope for the first segment, and U1.x, which is the slope of the second segment.

**UPDATE:**

**U1.x is not the slope of the second segment. It is the difference in slopes between the first and second segment. So if your coefficients are x = -0.9699 and U1.x = 1.4163, then the slope of the second segment is -0.9699+1.4163 = 0.4464. Using the slope(segmented.mod) command will give you the slopes of each segment, which should match the above calculation.**

Plotting it is simple:

plot(x,y, pch=16, ylim=c(5,20)) plot(segmented.mod, add=T)

There you go: two methods for piecewise regression. Each is relatively simple to implement, but they do very different things. If you are dead set on piecewise regression, make sure you choose the one that makes the most sense for your question: do the segments need to be continuous or can they be discontinuous? Make sure you have a rationale for either.

Also note that these models are **not**** nested, **so you can’t used likelihood ratio tests for model selection. You can try to use AIC for model selection, but I advise you to think logically about the continuity of the segments instead.

Hi

This piecewise regression is also called sometimes “threshold regression”. In the time series context, threshold auto-regressive models (where y_{t-1} is the regressor) are available in the package tsDyn also.

Thanks for pointing this out! I’m very unfamiliar with the methods and packages available for time series and often forget about them, simply because I don’t work with that type of data regularly.

Hi Nathan, I am also currently fiddeling around with PW-regressions. Did you ever a lattice plot with multiple segmented regressions in a lattice layout? i.e. 2 rows 3 columns and each one with its one seg-regression? nice post!

I haven’t before, but I just played around with it based on my example. segmented.mod$psi contains the breakpoints, so you can make use of that. Here’s a fairly simple code to do what you’re requesting based on the above example, which is simplified with one predictor and one break (you can view this code as a continuation of the post). This also assumes that you’re using the model from the ‘segmented’ package as the preferred model.

This is about as complicated as I’ve gotten, but you can probably extend this up to multiple breaks fairly easily (although nothing is ever truly easy).

Hi, that´s very nice! I was thinking of another output though. I have 20 stations for which I want to have a PW-regression in each panel. Hence it would be something like xyplot(hour ~ ppm | station, data=data.frame,panel=function(x,y,…) { segmented regression here?}

But many thanks for your example! this brought me again a bit closer. I think I will give it a try on stackoverflow.com

cheers

Ahh sorry I misunderstood. Are you thinking of a separate piecewise regression, one for each station then? I’m not sure, but the code provided by Z below might do what you’re thinking more easily.

Hi Nathan,

Some time ago I wrote the following code for a grid search, am not 100% sure but I think this is what you refer to as “iteratively”. Code can be found here: http://eranraviv.com/blog/piecewise-regression/

Thanks for the post!

Yes, your code is very similar to my procedure. I played with it a bit and I’m not sure the two methods give equivalent results, but I haven’t really dug into it.

Note that for optimizing piecewise regressions without constraints for adjacent segments, the “strucchange” package can be used. It assumes that the observations are already ordered with respect to the variable along which you want to split (which may be different from the variable/s you regress on). In this case, the observations are already ordered with respect to x, hence I can directly do the following:

This considers all possible solutions with a minimum segement size of 4 observations, i.e., up to 3 breakpoints. The plot then compares the best residuals sum of squares and associated BIC that can be obtained for 0, 1, 2, 3 breaks. There’s a clear drop from 0 to 1 where BIC is minimized. The associated breakpoint is

Note that returns the last value of x in the first segment (not the first value of x in the second segment as your solution I did). The associated models are of course equivalent. Finally, you can extract intercept and slope in both segments or visualize the models. A lattice xyplot that one of the previous commenters suggested can also be easily created:

For more details on finding breakpoints with strucchange, see the associated manual page and also the corresponding 2003 paper published in Computational Statistics and Data Analysis – the second reference in citation(“strucchange”).

Great comment. Thanks for the example!

How about fitting a model like y = \beta_0 + \beta_1 * abs(x – \beta_x) + \beta_2 * x

say using nls? I’ve used this with some success in circumstances where I needed a

two-segment fit. The slopes of the two segments are \beta_2 + \beta_1 and \beta_2 – \beta_1 and the intersection is \beta_x. What’s nice is that you can parameterize it to constrain one or both slopes to have fixed values, as well.

See Equations (1) in

http://www.sbri.fr/files/publications/knoblauchvitalbarbur01.pdf

and also Equation 2 of

http://www.journalofvision.org/content/11/10/15.full.pdf

for an example of a fit for which the slopes are constrained.

I like this a lot. It’s very simple and intuitive. The methods above probably give reliable starting estimates for the non-linear regression as well. Thanks for the tip!

Hi, I have been using the segmented package to fit regressions to describe a stepwise change in heart rate in single individuals and it has worked brilliantly for this purpose. I want to compare levels of variation around each of the segments (two in our instance), however in the output I get both a coefficient and S.E for each line through the summary(model) and if I use the slope(model) I get a slope and S.E too. The output from slope(model) makes more sense in light of my data i.e. a negative value for a slope corresponds to a negative slope, whereas the same is not true for the summary(model).

However, I just wanted to clarify whether this approach is correct and that slope(model) produces the values I should be using to describe the gradient of slopes and the error about each slope?

Many thanks for any advice and also this very useful article.

Oli

I updated the post to address your issue because there was a mistake in it that you caught. Using the segmented package, the summary(model) gives you x and U1.x. x is the slope of the first segment, and U1.x is the difference between the original slope and the new slope. slope(model) gives the correct slopes, but I’m betting that if you add x and U1.x together, you’ll get the same values reported by slope(model). At least, that’s how it worked for me.

I’d use the output reported by slope(model).

Hi Nathan, thank you for the excellent article. I have two questions: (1) How would you change your R code to estimate a piecewise regression on time series data in an xts or ts object? and (2) Is it possible to estimate a piecewise multiple regression where only some (not all) of the independent variables have breaks?

Hi Geoffrey, I’ll try to answer your questions as best I can.

1) I’m not too familiar with this. My default starting point would be to compare linear models with and without autoregressive error functions using AIC. If the autoregressive error function doesn’t add much to the model, then I’d treat the time series as a regular linear model and use the standard technique. I’m not totally sure it’s appropriate, but that’s where I’d start. If you must use a time series analysis because of autocorrelation, I’d look into the ‘tsDyn’ package suggested by Matthieu (he mentions threshold autoregression, which may be what you’re looking for).

2) I believe the ‘segmented’ package can do this. You specify your regular linear model (lin.mod <- lm(y ~ x1*x2)) and then in the segmented command, you only specify whichever effects you are interested in (segmented.Z=~x1). I'd double check on that, but that's the impression I get from the help file. Check the Rnews article for more info: http://dssm.unipa.it/vmuggeo/segmentedRnews.pdf

Hi! I am a new R user and I am trying to use Method 1 above to find a breakpoint in my data. Is there a way to get more refinement in the breakpoint? That is, I would like more sig figs in the output. Is it always going to yield a whole number?

It will in my example because I was only searching whole numbers. If you set up a sequence of numbers with more significant digits, you can get more refinement. For example, I was searching only integers between 1 and 20 (or so). In R, this is 1:20. If I want to search between 1 and 20, but include three significant digits, I set up a range of numbers seq(1, 20, by=0.001). Be careful, because this is an incredibly huge amount of numbers, so you need to balance the range you’re searching (i.e. 1 and 20) with the precision (i.e. 0.001) to optimize the amount of time it takes. So in my example, I was searching integers 5:20, which I set up using breaks <- 5:20.

If I wanted to search between 10 and 16, by every 0.01, I reset breaks as breaks <- seq(10, 16, 0.01).

Dear Nathan,

There has been some animated discussion about whether there has been a “pause” in global temperature increases in the last 16 years

see : http://joannenova.com.au/2013/02/the-emerson-v-bolt-argument-on-air-does-emerson-not-know-statistics/#more-26952

Much of this discussion illustrates a misuse of statistical significance

How do you think this statistical question could be better put. Could you fit a linear model to the entire data set and compare with a piecewise regression with a breakpoint 16 years ago ?. A two regression line model obviously has more parameters than a single line – would AIC/BIC adequately test this ?

I hadn’t heard of this argument. Thanks for pointing it out. Clearly, warming trends are difficult to analyze and depend entirely on the dataset used, the location in question (some places are warming faster than others), the length of the time series used, etc. I’m not familiar enough with this argument or the data in question to weigh in on the argument. So my comment is purely statistical (which I think is what you’re getting at).

First, fitting a linear regression to time series data is inappropriate (which it looks like is what the blog author did). This is a time series, and you can clearly see autocorrelation present in the graph. So you need an autoregressive model, and you can use piecewise regressions with autoregressive functions.

Second, I think AIC would be a good way to test this. In fact, let’s say you use the absolute value function to fit the piecewise regression, where y = b0 + b1*x + b2*|x-c|. You can fit a second, OLS model: y = b0 + b1*x. The linear model is a subset, or nested within, the absolute value function (set b2 = 0). I believe (not 100% sure, but pretty sure) in this case, not only can you use AIC, but you can use a log-likelihood ratio test since one model is nested within the other. If you use the other methods described, you may have to use AIC, which can distinguish among non-nested models.

Hi

Note that in the case you have, the “transition” variable is not a simple variable, but it is time. Although this seems to be a small difference, it has some implications on the inference for the parameters, as well as for the testing procedure, so I do not think you can use directly these methods.

However, R has a very nice strucchange package, that will allow you to do everything you want to estimate and test the “structural break” you mention🙂

Thanks! I never deal with time series data, but I do know that when time is the predictor, things get tricky. I keep referring people to your comment when dealing with autoregressive piecewise regressions because strucchange and tsDyn seem to handle this issue well.

I guess whether you want to fit a piece-wise linear model will depend on whether you are willing to accept that the climate can change abruptly or do so gradually. I blogged about this topic in the context of fitting additive models to the NH temperature series, which you might find interesting or useful for comparing with any piece-wise linear fits you do.

Nathan,

Great post, I am interested in using segmented regression to model the thermoneutral zone and lower critical temperature of metabolism in endotherms. The thermoneutral zone theoretically has zero slope (metabolism doesn’t change with temperature in this zone). Is there a way to force the slope of one segment of the regression to be zero???

Thanks

andy

There is. This was covered in one of the previous comments actually

Pingback: Piecewise regression | Nguyen Hoai Tuong

Thanks a lot for this very clear paper.

I am also playing around with piecewise regressions and was wondering if with your ‘manual’ method there is a way to impose a zero-slope on one of the piecewise regressions.

Thank you,

Yes, there is. You can use dummy variables in a clever way to get this to happen. Imagine you’ve run the LS algorithm to find the breakpoint (the ‘manual’ method). The following code will force the line greater (or less than) the break point to be flat.

Note that this still allows discontinuous jumps. I tried playing with the segmented package to get a continuous line, but I haven’t come up with a way just yet. I’m sure there’s a way to incorporate this setup into the absolute value function as well, but I haven’t played with it much. This should get you started though.

Sure with the absolute value function, you can constrain the slopes of either segment. To refresh from above, the function to fit is

y = \beta_0 + \beta_1 * abs(x – \beta_x) + \beta_2 * x

You can calculate the values y at 0, \beta_x and 2 * \beta_x from which you can derive expressions for the slopes of the two segments. Then, fix one of the slopes to an arbitrary value, say 0, and solve for the values of the beta’s that you need to fix. Again, see

http://www.journalofvision.org/content/11/10/15.full.pdf

where we fixed the initial slope to -0.5 and the second to 0, for theoretical reasons.

Yep, I just sat down and did this. For example, if you want to fix the slope x > c to zero, here is the proof:

When x > c, the absolute value function b0 + b1|x-c| + b2x turns into b0 – b1c + (b1 + b2)x.

However, we want this slope to be zero, (b1 + b2) = 0.

Solve for b1

b1 = -b2.

Plug this into the original equation

b0 + b1|x-c| + -b1x

b0 + b1(|x – c| – x)

Run that last equation through nls() and there you have it. This gives the same result as Ken’s formula in the article:

b0 + b1(|x-c| -x + c)

Dear Nathan,

thank you for the codes, it is very helpful. Would there be another reference for the iterative method than ‘The R Book’ by Crawley ? I looked in that book and it doesn’t give any references either…

Not that I know of

Hi

Could you precise the page in Crawley you found this? It is somehow srange to coin this iterative search, as here evaluations are independent of each other.

This type of search comes by multiple names, grid search regarding the algorithm, and regarding the estimator, profile likelihood/SSR, or concentrated least square (as you search only the hyper-parameter, concentrating out/profiling the other slope parameters).

Standard reference for this (including refs to original papers) would be: Hansen (2000) Sample Splitting and Threshold Estimation, Econometrica, Vol. 68, No. 3. (May, 2000), pp. 575-603.

Mat

Sure. If I remember to later this week, I’ll post it.

Crawley (The R Book): pg 425 – 430

Dera Nathan,

I used tow segmented linear function. where the the slope of one segment is fixed to Zero while the intercept is varying. the breakpoint and the other Segment has no constraints. This was done for different genotypes and what i want to do is to compare the model parameteres between genotypes somthing like (Tukey test ) and to have the significant lettter. is that possible?

thank you in advance

Hi Nathan,

I am trying to fit some segmented models to a large and messy data set and your post has been very helpful. I have played around a lot with the segmented package but i keep getting wildly different results for my segmented models on the same variables. This has made me reconsider doing it manually. I have followed your procedure and I get results but they are 1) fairly different from the segmented() model and 2) different from what I would guess. Maybe there is just a difference that I need to make in the coding of the for() loop. This is what I have.

breaks=-2 & RE<=2)]

mse <- numeric(length(breaks))

for(i in 1:length(breaks)){

piecewise1 <- lm(WTM ~ RE*(RE =breaks[i]))

mse[i] <- summary(piecewise1)[6]

}

One question I have is what is that [6] doing? Maybe thats the issue? I have changed that around but then I get an error.

plot(breaks,mse)

Error in xy.coords(x, y, xlabel, ylabel, log) :

(list) object cannot be coerced to type 'double'

Any suggestions?/Help? I apologize if I am missing a basic R concept or something that you already posted. Thanks!

The [6] pulls out the estimate of MSE from the regression.

First, I can’t tell what the first line does, and it may just be a wordpress error. Remember that you’re manually setting an area to search, so it should be something like:

breaks = -2 & RE <= 2)]

Second, I define the model differently. I'd use:

piecewise1 <- lm(WTM ~ RE*I(RE = breaks[i]))

That’s the proper model to give two sets of coefficients for each segment. Your model is mis-specified (and not really doing anything, in the RE=breaks[i] part. I’m surprised it doesn’t throw out an error here, because that’s not a logic statement).

Third, you’ve forgotten to covert mse to a numeric before plotting:

mse <- as.numeric(mse)

plot(breaks, mse)

Last, don't expect this search method and the segmented package to give the same results. This method does not constrain the segments to be continuous, whereas segmented will.

Thanks this is great. One other question. When reading directly from the output on what the p values mean. From top to bottom.

A) Test of whether the intercept is different from 0 (there is a linear relationship)

B) Test for whether the slope of the line to > BP is different from 0

C) Test for whether the intercept of the line BP is different from A

E) Test for whether the slope of the line < BP is different from B

And the f stat and pvalue for the whole output is testing whether the segmented relationship is "real" overall.

Right?

Thanks again

oh and

D) Test for whether the intercept of the line < BP is different from A

Hi Nathan,

I’ve found your R code extremely helpful in implementing my own two-piece regression model. However, it seems like my data may be better suited to a three-piece model. I’m wondering how you would adapt your code (method 1) to find two break points. I’ve been trying to figure it out, but I’m not amazing with R.

Thanks!

Erica

Thats a great question that I don’t know the answer to off hand. I know the segmented package can do multiple break points, but beyond that would probably take me a while to figure out. Try starting here: https://rpubs.com/MarkusLoew/12164

Hi Nathan,

Thank you for providing an easy to follow explanation of piecewise regression. I have been using the first method to find the break point in my data but I’m also interested in the equations for the two lines fitted. I had not had any problems until one set of data resulted in the output below:

Call:

lm(formula = y ~ x * (x 1.167317))

Residuals:

Min 1Q Median 3Q Max

-0.04256 -0.01920 0.00000 0.01273 0.04759

Coefficients: (2 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.68164 0.09841 17.089 2.57e-06 ***

x 0.22311 0.06725 3.317 0.016056 *

x 1.167317TRUE NA NA NA NA

x:x 1.167317TRUE NA NA NA NA

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03221 on 6 degrees of freedom

Multiple R-squared: 0.9814, Adjusted R-squared: 0.9721

F-statistic: 105.4 on 3 and 6 DF, p-value: 1.404e-05

Would you know why there are two undefined coefficients and is there a way to still determine the equations of the regression lines?

Kind Regards,

Amy

it’s hard to know without seeing what y and x are. If you can share that, I can play with it to try and figure it out

That would be much appreciated. Thank you Nathan.

x <- c(1.694605199, 1.629409599, 1.561101384,1.501059262,1.444044796,1.369215857,1.260071388,1.167317335,0.954242509,0.397940009)

y <- c(2.107315802,2.044064085,2.004710406,1.973977879,2.012966149,1.961420389,1.986724077,1.956012392,1.944228391,1.40907652)

Hi Nathan,

Thank you very much for this great article about piecewise regression. I have been using the second method to find the break point in my data, and now I would like to find the confidence intervals for the breakpoint itself. Do you know of any method to do this?

Kind regards,

Matteo

If you’re referring to the lm() approach, there isn’t CI for the breakpoint. This is because its a part of the model you’re specifying, not an estimated parameter. If you want to get CIs on a break point, either use absolute value regression or the ‘segmented’ package.

Dear Nathan,

Thank you very much for your very helpful blog about the piecewise regression. I just started working with it but I’m a bit wondering how robust the package “segmented” is. Maybe I did something wrong but if I let the code run several times without changing anything I get different values for the breakpoint. In this case it does not matter if I use psi=17 or psi=median(x) I anyhow get differing values for the breakpoint. Do you know how to overcome this problem? Any advise? Thank you very much already in advance!

Kind regards,

Nadine

P.S: This would be my code:

x <- soil.moisture

y <- tree.water.deficit

piece <- segmented(lm(y~x), seg.Z = ~exp(-x), psi=17) # or psi=median(x)

summary(piece.test)

Dear Nathan,

Thanks for your helpful blog. I’m working with piecewise regression and i was trying the first method (iterative searching). I was trying to understand what is doing exactly this R-script and to do that I tried to decompose the piecewise function into its two parts. For instance, taking your generated segmented data, I took just one break x-point (x=9) to see what’s going on in each specific point, so I run the model:

piecewise1 <- lm(y ~ x*(x =9))

and I did the same whit only the “first term” and the second one separately:

piecewise1.1 <- lm(y ~ x*(x < 9))

piecewise1.2 =9))

These three models had exactly the same RSE, F-value, p-value, adjustedR2 for the whole model, but they had some differences in the coefficients (intercept, x) values output between the “two-pieces model” (piecewise1) and the partial model 1.2 (piecewise1.2 =9)). But, partial models 1.1 and 1.2 had exactly the same coefficients values for x=9, and x:x=9, but with the opposite sign (see the R-outputs below).

My queries are:

1. I don’t understand exactly what is mathematically happening behind this R-code: if the code splits up the total data in two (or more) segments, how is possible to obtain unique RSE, F-value, p-value, adjustedR2 values for the whole model? How are these values calculated?

2. Do you think it is correct to work only with one part of the piecewise model (piecewise1.1 <- lm(y ~ x*(x < 9)) ) instead the complete model (piecewise1 <- lm(y ~ x*(x =9))) when you are looking for the breakpoint? (Because the mse -or RSE value- It’s exactly the same in the three cases and, if I understood it right, this is the only value interesting to look for the breakpoint).

Thank you very much in advance!

Sincerely

Sergio Osorio

Hi Sergio,

I don’t fully understand your models. I re-ran my code:

x <- c(1:10, 13:22)

y <- numeric(20)

## Create first segment

y[1:10] <- 20:11 + rnorm(10, 0, 1.5)

## Create second segment

y[11:20] <- seq(11, 15, len=10) + rnorm(10, 0, 1.5)

Then I tried your two partial models and I got very different results:

piecewise1 <- lm(y ~ x*(x==9))

(Intercept) 16.50550 1.07033 15.421 1.99e-11 ***

x -0.17967 0.07944 -2.262 0.0371 *

x == 9TRUE -1.74902 2.42469 -0.721 0.4805

x:x == 9TRUE NA NA NA NA

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.354 on 17 degrees of freedom

Multiple R-squared: 0.2409, Adjusted R-squared: 0.1516

F-statistic: 2.697 on 2 and 17 DF, p-value: 0.09608

piecewise1 <- lm(y ~ x*(x<9))

summary(piecewise1)

Call:

lm(formula = y ~ x * (x |t|)

(Intercept) 8.64769 1.34275 6.440 8.17e-06 ***

x 0.28100 0.08065 3.484 0.00307 **

x < 9TRUE 12.30521 1.59752 7.703 9.03e-07 ***

x:x < 9TRUE -1.36081 0.18942 -7.184 2.17e-06 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

`Residual standard error: 1.111 on 16 degrees of freedom`

Multiple R-squared: 0.841, Adjusted R-squared: 0.8112

F-statistic: 28.21 on 3 and 16 DF, p-value: 1.264e-06

Did I get this right as far as what you’re attempting?

Dear Nathan,

First of all, thanks a lot for your so quick reply!! I’m sorry but something weird is happening with my first post! I revised it and the models are not those that I posted (?). I write them all again (with the outputs that I got):

(May I have your e-mail adress to send to you the codes in another format to avoid these kind of problems?)

> piecewise1 <- lm(y ~ x*(x =9))

> summary(piecewise1)

Call:

lm(formula = y ~ x * (x = 9))

Residuals:

Min 1Q Median 3Q Max

-2.86566 -0.99281 0.03548 0.97120 2.83641

Coefficients: (2 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 10.5839 1.8637 5.679 3.42e-05 ***

x 0.1361 0.1119 1.215 0.241839

x = 9TRUE NA NA NA NA

x:x = 9TRUE NA NA NA NA

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.542 on 16 degrees of freedom

Multiple R-squared: 0.787, Adjusted R-squared: 0.7471

F-statistic: 19.71 on 3 and 16 DF, p-value: 1.272e-05

Just first part of model piecewise1:

> piecewise1.1 <- lm(y ~ x*(x summary(piecewise1.1)

Call:

lm(formula = y ~ x * (x |t|)

(Intercept) 10.5839 1.8637 5.679 3.42e-05 ***

x 0.1361 0.1119 1.215 0.241839

x < 9TRUE 11.6089 2.2173 5.236 8.16e-05 ***

x:x piecewise1.2 =9))

> summary(piecewise1.2)

Call:

lm(formula = y ~ x * (x >= 9))

Residuals:

Min 1Q Median 3Q Max

-2.86566 -0.99281 0.03548 0.97120 2.83641

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 22.1927 1.2013 18.475 3.24e-12 ***

x -1.2087 0.2379 -5.081 0.000111 ***

x >= 9TRUE -11.6089 2.2173 -5.236 8.16e-05 ***

x:x >= 9TRUE 1.3447 0.2629 5.115 0.000104 ***

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.542 on 16 degrees of freedom

Multiple R-squared: 0.787, Adjusted R-squared: 0.7471

F-statistic: 19.71 on 3 and 16 DF, p-value: 1.272e-05

I hope you receive it correctly now!

Thanks a lot in advance!

S.

Nathan,

It happened again. I revised my second post and my models have been altered! I don’t understand what’s going on, but I think it could be a problem of format. May I have your e-mail adress to send to you a word document with my codes?

Thanks a lot!

S.

Nathan,

It happened again. I revised my second post and my models have been altered! I don’t understand what’s going on, but I think it could be a problem of format. May I have your e-mail adress to send to you a word document with my codes?

Thanks a lot!

S.

Dear Nathan,

thank you for your post and your help.

I used the second method and i have one question. If i need the square root of each segment what do I have to do??

Thank you in advance!

A.

What do you mean by square root of each segment?

Sorry, I mean R2 (R squared) of each segment.

Ah I see. No I don’t, I’m not sure you can get an R2 for each particular segment, although the output should show you R2 for the whole model.

You can try a ‘pseudo-R2′, which would just be the square of the Pearson correlation coefficient between fitted and observed values in each segment. Or just use the Pearson correlation coefficient without squaring.

The functions above is for linear regression with only one change point, But

I really need help in fitting linear regression with more than change point in R.

The segmented package can do that.

Thanks Nathan,

I use the package segmented but I have never used it for more than one change point. The code I used before in case one change point is:

library(segmented)

O = segmented.lm(lm(y ~ x), seg.Z=~x, psi=list(x=2.61),

control=seg.control(stop.if.error=FALSE, n.boot=0, it.max=20))

But it doesn’t work in my recent case, so could you please help me with that also if I can analyze this data with different codes or packages.

This is a great post, thanks so much for the information.

I am having trouble with it, however, and was hoping you might be able to help? I have data that looks like this:

age lnfreq

18 0

20 1.386294361

21 0

22 1.098612289

24 1.098612289

26 0

28 0

29 0.693147181

30 0

When I run this in R using

lin.mod <- lm(y~x)

segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=18)

I get the following error, despite changing the psi value: 'Error in segmented.lm(lin.mod, seg.Z = ~x, psi = 18) : only 1 datum in an interval: breakpoint(s) at the boundary or too close each other'

Could you possibly tell me what I can do to correct for this? I have 20 sites and have had no trouble except at this one particular site.

Thanks so much.

Without getting too much into it, I suspect your problem is because you’re trying to set the breakpoint at the lowest x value (18). Also, looking at your data, it’s hard to see where a breakpoint might be necessary.

Hey Nathan

I was wondering if it is possible to fit y ~ bo + b1*x + b2*(x-c), where b1=-b2, using the segmented package. My calculations suggests that my data should show equal magnitude slopes in both the directions about the break point.

Hi Nathan,

Thank you for this post- its been very helpful. I am seeking your opinion about your statement “make sure you choose the one that makes the most sense for your question: do the segments need to be continuous or can they be discontinuous?”

I am working with annual population abundance estimates in a time series. We are testing how population size patterns change at a specific year of known competitor increase. To me, it makes sense to have have two separate (discontinuous) slopes- representing the population under the two scenarios (pre and post competition). In the literature, I have seen continuous piecewise regression used for similar time series and research questions, though no justification is provided and discontinuous segments are not discussed. This makes me wary that piecewise regression was applied without considering the two options.

Because the continuous piecewise regression adds the break when y is minimized, it does not seem to be the appropriate tool for testing a break at a specific year.

Does my logic make sense? Under what circumstances would it be justified to have the segments continuous for annual population estimates?

Thanks for your input, I appreciate it.

This is an interesting question and is more logical/conceptual than mathematical. I can argue both ways.

Let’s start with the continuous segments. This assumes that there is a drastic, sudden change in slope but that the initial condition at the new segment is identical to the ending condition in the first segment. In terms of your question, your population might be increasing in the pre-competition phase but then suddenly decrease or become stable in the post-competition phase. The continuous model assumes that the transition is smooth, that is to say that population size in the last year of pre-competition is the initial starting point for population size in the post-competition year. To me, this makes logical sense.

On the other hand, you’re using annual data. I don’t know what your study species is, but for many species that have multiple generations per year, you might not expect a smooth transition. Obviously with finer temporal resolution, you should see a relatively smooth transition between pre-competition and post-competition population sizes. On an annual scale, if you miss that transition period, you might see a big difference between the last pre-competition population and the first post-competition population. Essentially, you’re sampling doesn’t capture the transition. Does that make sense?

So I’m afraid I’m not helpful, because I could convince myself that either method is appropriate. It depends on your study organism and sampling design. The intuitive answer is to use the continuous model because the last pre-competition population size HAS to be the initial post-competition population size, but your sampling design may not capture that smoothness.

That was very helpful, thank you for taking the time to give your thoughts. I can also see it rationalized both ways. The annual survey data does make it tricky.

Is there a way in the segmented package to force a break year as the point where the two slopes to meet? When I run the package with the instructions above, it shifts a few years forward from the date we are testing.

that I don’t know. I don’t have too much experience with ‘segmented’.