Various commands in R accept a notation called model formula, or simply formula. The simplest form of the formula is,

where x and y are two variables. You can read this as ‘y is explained by x’. The dependent or response variable goes to the left of the tilde ‘~’ and the explanatory or independent variables goes to the right. This formula roughly corresponds to the linear equation,

y = ax + b

The interpretation is slightly diﬀerent if the variables are categorical. Note that the intercept, b, is implicit in the model formula. If you like, you can be explicit by using the notation + 1. Or if you want to exclude it, e.g., force a regression line passing through origin, you can exclude it by notation – 1. In case you have multiple explanatory variables, it is easy to include them using the same notation. For example if you had two explanatory variables x1 and x2, you can specify it like this:

The linear equation that correspond to this notation would be y = a1x1 + a2x2 + b.

As you may have ﬁgured out already, the arithmetic operators such as + and – have diﬀerent meanings in a formula. So, if the variable you are interested is a combination of R variables, then you need a special notation. For example, you might be interested in ﬁtting a linear model where y is explained by the sum of x1 and x2. That is, the equation you want to describe is y = a × (x1 + x2) +b. In such cases you need to use a special function, I(), to protect the arithmetic operation from being interpreted as part of the formula. In the case of our example, the correct formula notation is

If your explanatory variables are categorical, as in ANOVA, you may ﬁt a model where interaction of the variables is important. Interaction of variables in a formula is expressed with a term where variable names are concatenated with column(s) between the variables. For example, the formula

expresses a model where interaction of x1 and x2 are also included in the model ﬁtting. For two variables, we have only one possible interaction. If you have many variables, and want to include all interaction terms, it may be a hassle to type all the interaction terms separately. For example, all interactions of three variables x1, x2 and x3 consist of the two-way interactions x1:x2, x1:x3,x2:x3 and the three way interaction x1:x2:x3. To include all interactions, you can use ‘*’ instead of ‘+’. For example, to include three variables and all interactions in a model formula, we simply type y ~ x1 * x2 * x3.

The usual mathematical operators do not do what you may think. Here are a few different possibilities that will suffice for these notes.

Suppose the variables are generically named `Y, X1, X2`

formula |
meaning |

Y ~ X1 |
Y is modeled by X1 |

Y ~ X1 + X2 |
Y is modeled by X1 and X2 as in multiple regression |

Y ~ X1 * X2 |
Y is modeled by X1, X2 and X1*X2 |

Y ~ (X1 + X2)^2 |
Two-way interactions. Note usual powers |

Y ~ X1+ I((X2^2) |
Y is modeled by X1 and X2^{2} |

Y ~ X1 | X2 |
Y is modeled by X1 conditioned on X2 |

Q:

How should I define a model formula in “R”, when one (or more) exact linear restrictions binding the coefficients is available.

Equation: y = b1*x1 + b2*x1

where y = b1*x1 for t < t1 and y = b2*x1 for t > t1

A:

Just create two new vectors:

```
x2 <- ifelse(t<t1, x1, 0)
x3 <- ifelse(t<t1, 0, x1)
```

Now you can simply fit `y ~ x2 + x3 –1`

.