::install_github('tsostarics/contrastable')
devtools::install_github('tsostarics/contrastable') remotes
I’ve been working on a package called contrastable
on and off for the past year or so. The package’s goal is to provide a tidy approach to setting factor contrasts for regression analysis. While this can be done with repeated contrasts<-
calls, this workflow is tedious when working with multiple factors and especially error-prone when manually specifying contrast matrices to use. In this latter case, the user would need to be careful to specify the correct fractions in the correct order with the correct signs, which can be a lot to keep track of. These issues quickly become apparent when the number of factor levels is greater than 2. In this post I will:
- Run through an example of a typical contrast coding workflow using
contrasts<-
. I will give an example of an error that can arise due to a typo, and show how to diagnose what this error actually reflects by checking the hypothesis matrix. - Show how the
contrastable
package can be used to sidestep mistakes caused by error-prone and tedious calls tocontrasts<-
. - Briefly link to other packages on contrasts and level comparisons.
You can find the contrastable
package on my Github at this repo: tsostarics/contrastable You can install it like so:
Contrasts overview
Contrast coding refers to assigning numeric values to levels of a categorical variable for use in regression analyses. Depending on the numbers used, different comparisons can be made between the group means of a variable. These comparisons can correspond to particular null hypotheses that a researcher might have, and particular combinations of numbers can encode high-level questions like “Are there differences between levels when compared to a common reference level?” or “Does each level differ from the levels that came before it?” Critically, the contrasts used don’t impact the model fit but do impact the coefficient estimates that are used to make inferences about the data. For example, you might conclude that there’s an overall effect of some factor when in reality the effect (shown by the coefficient estimate) is an effect that only holds for one particular group!
Consider an example where you have two groups of listeners where English is their native (L1) or non-native (L2) language. You might be interested in whether reading times are slower or faster in two different syntactic conditions, such as active vs passive constructions. Two possible research questions might be whether there’s an main effect of syntax on reading times on the one hand or whether there’s a simple effect such that reading times in the passive construction are only slower for L2 speakers.1 These are similar, but different, research questions and more importantly, the interpretation of one coefficient depends on how other variables are coded. Many researchers realize (or are starting to at least) that the default2 “0/1 contrasts” (aka treatment or dummy coding) will only give them the simple effect of structure, but if what you’re interested in is that main effect, then your statistics has not yet answered your question! To rectify this, researchers will opt for “the +.5/-.5 contrasts” to obtain main effects.
The name for this contrast scheme is not consistent, especially in the 2-level case where the values are +.5/-.5. I’ve seen it called sum coding, simple coding, effects coding, scaled sum coding, helmert coding, difference coding, contrast coding, sum-to-zero coding, and +.5/-.5 coding. See Brehm and Alday (2022) for concerns about transparent descriptions of contrast coding schemes.
While researchers may differ on what they call +.5/-.5, for 2 levels the result is nonetheless the same. But, a researcher using “helmert coding” and a researcher using “scaled sum coding” for a factor with 3 or more levels will be using very different contrast matrices, and thus address very different research questions about the data. Let’s use some functions from contrastable
to look at how these contrast matrices differ.
library(contrastable)
helmert_code(2) |> MASS::fractions()
[,1]
[1,] 1/2
[2,] -1/2
scaled_sum_code(2) |> MASS::fractions()
[,1]
1 1/2
2 -1/2
helmert_code(3) |> MASS::fractions()
[,1] [,2]
[1,] 2/3 0
[2,] -1/3 1/2
[3,] -1/3 -1/2
scaled_sum_code(3) |> MASS::fractions()
[,1] [,2]
1 2/3 -1/3
2 -1/3 2/3
3 -1/3 -1/3
helmert_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
[1,] 4/5 0 0 0
[2,] -1/5 3/4 0 0
[3,] -1/5 -1/4 2/3 0
[4,] -1/5 -1/4 -1/3 1/2
[5,] -1/5 -1/4 -1/3 -1/2
scaled_sum_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
1 4/5 -1/5 -1/5 -1/5
2 -1/5 4/5 -1/5 -1/5
3 -1/5 -1/5 4/5 -1/5
4 -1/5 -1/5 -1/5 4/5
5 -1/5 -1/5 -1/5 -1/5
Note that I use the term scaled sum coding for the “pairwise comparisons with main effects” contrast scheme. I opt for this term for three reasons.
- First, I see sum coding used more frequently in statistics and econometrics to refer to +1/-1; this is also what
contr.sum
in R returns. - Second, the salient part of going from sum coding to scaled sum coding, especially in the 2-level case, is that there’s some kind of division or scaling operation involved; I frequently see people use
contr.sum(2)/2
, although importantlycontr.sum(3)/3
does not yield the expected result. - Third, “simple” coding is counterintuitive to me since we’re trying to avoid “simple effects;” “effects coding” and “contrast coding” are largely meaningless as all coding schemes will encode some kind of effect, and setting any contrast matrix is an instance of contrast coding.
So, for the researcher trying to remember “I need to use those contrasts where they’re divided to get the main effects”, it (to me) seems easy to reach for a tool where scaled is in the name and is clearly distinguished from sum coding. 3
Typical approach to contrast coding
Typically when I see people in Linguistics set contrasts, they do something like the following, using the palmerpenguins
dataset as an example.
Code
library(dplyr) # Data wrangling
library(purrr) # Mapping functions
library(palmerpenguins) # Dataset
<- penguins
penguins_with_contrasts
# Default treatment/dummy coding for a 2 and 3 level factor
contrasts(penguins_with_contrasts$sex)
male
female 0
male 1
Code
contrasts(penguins_with_contrasts$species)
Chinstrap Gentoo
Adelie 0 0
Chinstrap 1 0
Gentoo 0 1
Code
# Easy enough for 2 levels, -contr.sum(2)/2 is also used a lot
contrasts(penguins_with_contrasts$sex) <- c(-.5, .5)
# Not so fun for three levels!
contrasts(penguins_with_contrasts$species) <- matrix(c(-1/3, 2/3, -1/3,
-1/3, -1/3, 2/3),
nrow = 3)
The chance of making a mistake increases when including more and more categorical variables. Catching these mistakes can be very difficult, in part because this workflow erases the labels in the regression output. This means you have to keep track of what 1
and 2
in the regression coefficients correspond to.
While the dimnames
argument can be used to set the labels, anecdotally I rarely see people use this in their analyses when perusing code on the osf. Winter (2019, 127) notes that “Using the ‘1’ after the predictor name is a notational convention for representing the slopes of sum-coded predictors in R” but this is slightly incorrect; in the absence of dimnames
being set, R will use the numeric indices of the contrast matrix’s columns (no matter what the scheme is).
Below, the two sets of coefficients represent pairwise comparisons to the Adelie
baseline, but the intercepts differ due to how the contrasts are set, with the first using treatment coding and the second using scaled sum coding. I’ll start with a case that only considers the categorical variable, but will include an additional continuous independent variable later on.
Code
# Compare the default treatment coding with the penguins dataset
# with the contrasts we specified in penguins_with_contrasts
<- coef(lm(bill_length_mm ~ species,
treatment_coefs data = penguins))
<- coef(lm(bill_length_mm ~ species,
scaledsum_coefs data = penguins_with_contrasts))
# I'm using list() to print and caption results side by side, purely aesthetic
list("(Default) Treatment Coding" = treatment_coefs,
"(Manual) Scaled Sum Coding" = scaledsum_coefs)
$`(Default) Treatment Coding`
(Intercept) speciesChinstrap speciesGentoo
38.791391 10.042433 8.713487
$`(Manual) Scaled Sum Coding`
(Intercept) species1 species2
45.043364 10.042433 8.713487
The model coefficients for the scaled sum coding shows the same pairwise comparisons as the model using treatment coding, but the intercepts differ. We can check what they correspond to manually:
Code
<-
group_means |>
penguins ::group_by(species) |>
dplyr::summarize(mean_length = mean(bill_length_mm, na.rm = TRUE)) |>
dplyr::pluck('mean_length') |>
purrr`names<-`(c('Adelie', 'Chinstrap', 'Gentoo'))
list("Group means"= group_means,
"Grand mean" = mean(group_means))
$`Group means`
Adelie Chinstrap Gentoo
38.79139 48.83382 47.50488
$`Grand mean`
[1] 45.04336
So the intercept for the treatment coded model is the mean of the Adelie
group while the scaled sum coded model is the grand mean, or the mean of group means. But, typing in the scaled sum contrast matrix was a bit obnoxious with all the -1/3
we typed. If we had made a slight mistake while typing the matrix out, what would have happened to our model? Would our coefficients reflect the averages and differences we were expecting? As an example, let’s see what happens when we change a 2/3
to 1/3
:
Code
# What if we accidentally typed 1/3 instead of 2/3?
contrasts(penguins_with_contrasts$species) <- matrix(c(-1/3, 1/3, -1/3,
-1/3, -1/3, 2/3),
nrow = 3)
<- coef(lm(bill_length_mm ~ species,
mistake_coefs data = penguins_with_contrasts))
list("(Current) Mistaken Scaled Sum Coding:" = mistake_coefs,
"(Previous) Correct Scaled Sum Coding:" = scaledsum_coefs)
$`(Current) Mistaken Scaled Sum Coding:`
(Intercept) species1 species2
46.717103 15.063649 8.713487
$`(Previous) Correct Scaled Sum Coding:`
(Intercept) species1 species2
45.043364 10.042433 8.713487
Here we can see that the intercept and the value for species1
have increased in magnitude. In particular, the new reported effect of species1
is much larger than it previously was. If we stopped at this point, we would conclude that the difference in bill length between the Chinstrap and Adelie groups is a whopping 15mm (remember we originally calculated it to be about 10). If we were interested in whether there was a positive or negative difference that was significant or not, we’d still make that conclusion, but any claims about the magnitude of the effect would be misguided. This problem opens up a related question though: What does this new inflated-in-magitude coefficient estimate represent?
Diagnosing our mistake
To check what these numbers correspond to, we have to check the hypothesis matrix that corresponds to our contrast matrix. The process of obtaining the hypothesis matrix has been referred to as finding the generalized inverse of the contrast matrix (see Schad et al. 2020 for details).
Code
matrix(c(1, 1, 1, # Add a column of 1s for the intercept
-1/3, 1/3, -1/3,
-1/3, -1/3, 2/3),
nrow = 3,
dimnames = list(NULL, c('Intercept', 'species1', 'species2'))) |>
t() |>
solve() |>
::fractions() # This function just shows numbers as fractions MASS
Intercept species1 species2
[1,] 1/6 -3/2 -1
[2,] 1/2 3/2 0
[3,] 1/3 0 1
Here the intercept is represented by the weighted sum of each group mean, where the weights are shown in the intercept column. In most cases, the intercept should reflect the grand mean, or the mean of the group means, and so would usually have equal weights (i.e., 1/3
here) for the levels. In this case, we see the fractional weights are not the same. We can verify this by calculating the weighted mean ourselves:
Code
list("Grand Mean" = mean(group_means),
"Weighted mean" = weighted.mean(group_means, c(1/6, 1/2, 1/3)))
$`Grand Mean`
[1] 45.04336
$`Weighted mean`
[1] 46.7171
Similarly, the coefficient for species1
shows the difference between the group means of levels 1 and 2 (i.e., mean of Chinstrap - mean of Adelie) but times a factor of 3/2
. Crucially, if our goal is to evaluate the difference between the means of these two levels, then our mistake in coding the hypothesis matrix will give us a larger estimate (~15 vs 10). Consider a similar setup where the larger estimate was 5 instead of 0; if we were relying on null hypothesis testing it’s possible we’d get a significant effect when really we shouldn’t have.
Code
list("Mistaken Scaled Sum Coding" = mistake_coefs,
"Correct Scaled Sum Coding" = scaledsum_coefs,
"Computed Chinstrap-Adelie Difference with 3/2 scaling" =
3/2 * group_means[['Chinstrap']]) - (3/2 * group_means[['Adelie']]),
("Actual Chinstrap-Adelie Difference" =
'Chinstrap']] - group_means[['Adelie']]) group_means[[
$`Mistaken Scaled Sum Coding`
(Intercept) species1 species2
46.717103 15.063649 8.713487
$`Correct Scaled Sum Coding`
(Intercept) species1 species2
45.043364 10.042433 8.713487
$`Computed Chinstrap-Adelie Difference with 3/2 scaling`
[1] 15.06365
$`Actual Chinstrap-Adelie Difference`
[1] 10.04243
Point being: we made an honest mistake of typing 1/3
instead of 2/3
but this had ramifications for the coefficients in our model output that we use to make inferences. In practice, because we did the multiple contrasts<-
calls, we would likely assume that what we did was correct in the absence of any errors.
Tidy approach to contrasts
Here I’ll show a different approach using the contrastable
package. This package takes a tidy approach to take care of the overhead of labels and reference levels involved when using common contrast coding schemes. Specifically, this package provides a series of functions that use a special formula implementation that assigns specific meanings to each operator. The left hand side of the formula is the factor column whose contrasts you want to change. The right hand side consists of (at minimum) a function to generate contrast matrices such as contr.treatment
or treatment_code
. Additional operators provide extra optional functionality:
+ x
: Set reference level to levelx
* x
: Set intercept to be the mean ofx
- 3:4
: For polynomial contrasts only, drop trends3
and4
| c("A-B", "A-C")
: Set the comparison labels toA-B
andA-C
(must be the last operator if used)
Recall that in many cases researchers want pairwise comparisons while retaining main effects, and so the choice of reference level for the comparisons is very important. By default, R uses the first level alphabetically as the reference level, but sometimes we want to change this manually ourselves. Here’s an example where we set the sex
and species
factors to the two contrast schemes we manually set before. The set_contrasts
function will show a message if it detects additional factor variables in the dataframe that the user did not provide contrasts for.
Code
# library(contrastable) was loaded earlier
<-
penguins_df |>
penguins set_contrasts(sex ~ scaled_sum_code + "male", # Set reference level with +
~ scaled_sum_code + 'Adelie') species
Expect contr.treatment or contr.poly for unset factors: island
Code
contrasts(penguins_df$species) |> MASS::fractions()
Chinstrap Gentoo
Adelie -1/3 -1/3
Chinstrap 2/3 -1/3
Gentoo -1/3 2/3
Code
contrasts(penguins_df$sex) |> MASS::fractions()
female
female 1/2
male -1/2
penguins_df
now has its contrasts set, and we can run our model as usual. Note that we didn’t have to type out any matrices ourselves, but we got the correct contrasts that we needed.
Code
coef(lm(bill_length_mm ~ species + bill_depth_mm, data = penguins_df))
(Intercept) speciesChinstrap speciesGentoo bill_depth_mm
20.997115 9.938955 13.403279 1.394011
If we wanted to change the labels to better reflect the comparisons being made, we could do that in the formula too with the |
operator.
Code
<-
penguins_df |>
penguins_df set_contrasts(species ~ scaled_sum_code + 'Adelie' |
c('Chinstrap-Ad', 'Gentoo-Ad'))
coef(lm(bill_length_mm ~ species, data = penguins_df))
(Intercept) speciesChinstrap-Ad speciesGentoo-Ad
45.043364 10.042433 8.713487
Additional functions
Typically when I use this package in my analyses the set_contrasts
function is all I really need, but there are other functions that follow the same syntax that provide other information. To avoid retyping things, I’ll usually keep the contrasts in a list assigned to a separate variable and pass that to functions.
The glimpse_contrasts
function can show information about the factors in a dataset along with the contrast schemes that have been assigned to each factor.
Code
<-
my_contrasts list(
~ scaled_sum_code + 'female',
sex ~ helmert_code
species
)
glimpse_contrasts(penguins_df, my_contrasts) |> gt::gt()
factor | n_levels | level_names | scheme | reference | intercept | orthogonal | centered | dropped_trends | explicitly_set |
---|---|---|---|---|---|---|---|---|---|
sex | 2 | female, male | scaled_sum_code | female | grand mean | NA | TRUE | NA | TRUE |
species | 3 | Adelie, Chinstrap, Gentoo | helmert_code | Gentoo | grand mean | TRUE | TRUE | NA | TRUE |
island | 3 | Biscoe, Dream, Torgersen | contr.treatment | Biscoe | mean(Biscoe) | FALSE | FALSE | NA | FALSE |
The enlist_contrasts
function does the same thing as set_contrasts
, but returns a list of contrast matrices that can be used in the contrasts
argument of some model-fitting functions.4 It also provides an easy way to show the contrast matrices in an appendix or supplementary material.
Code
enlist_contrasts(penguins_df, my_contrasts) |> purrr::map(MASS::fractions)
Expect contr.treatment or contr.poly for unset factors: island
$sex
male
female -1/2
male 1/2
$species
>Adelie >Chinstrap
Adelie 2/3 0
Chinstrap -1/3 1/2
Gentoo -1/3 -1/2
Available contrast schemes
Here are the different contrast functions this package currently provides.
Code
# = contr.treatment
treatment_code(5) |> MASS::fractions()
2 3 4 5
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
Code
# = contr.sum
sum_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 -1 -1 -1 -1
Code
# = contr.sum
scaled_sum_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
1 4/5 -1/5 -1/5 -1/5
2 -1/5 4/5 -1/5 -1/5
3 -1/5 -1/5 4/5 -1/5
4 -1/5 -1/5 -1/5 4/5
5 -1/5 -1/5 -1/5 -1/5
Code
# NOT = contr.helmert, which is unscaled
helmert_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
[1,] 4/5 0 0 0
[2,] -1/5 3/4 0 0
[3,] -1/5 -1/4 2/3 0
[4,] -1/5 -1/4 -1/3 1/2
[5,] -1/5 -1/4 -1/3 -1/2
Code
reverse_helmert_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
[1,] -1/2 -1/3 -1/4 -1/5
[2,] 1/2 -1/3 -1/4 -1/5
[3,] 0 2/3 -1/4 -1/5
[4,] 0 0 3/4 -1/5
[5,] 0 0 0 4/5
Code
forward_difference_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
[1,] 4/5 3/5 2/5 1/5
[2,] -1/5 3/5 2/5 1/5
[3,] -1/5 -2/5 2/5 1/5
[4,] -1/5 -2/5 -3/5 1/5
[5,] -1/5 -2/5 -3/5 -4/5
Code
backward_difference_code(5) |> MASS::fractions()
[,1] [,2] [,3] [,4]
[1,] -4/5 -3/5 -2/5 -1/5
[2,] 1/5 -3/5 -2/5 -1/5
[3,] 1/5 2/5 -2/5 -1/5
[4,] 1/5 2/5 3/5 -1/5
[5,] 1/5 2/5 3/5 4/5
Code
# = contr.poly, poly(1:n, degree = n-1, raw = FALSE)
orth_polynomial_code(5) |> MASS::fractions()
.L .Q .C ^4
[1,] -265/419 929/1738 -2026009/6406803 4018/33617
[2,] -2026009/6406803 -809/3027 191/302 -246481/515552
[3,] 0 -6263/11717 0 1042/1453
[4,] 12484830/39480499 -809/3027 -265/419 -246481/515552
[5,] 191/302 929/1738 10458821/33073696 4018/33617
Code
# = poly(1:n, degree = n-1, raw = TRUE)
raw_polynomial_code(5) |> MASS::fractions()
1 2 3 4
[1,] 1 1 1 1
[2,] 2 4 8 16
[3,] 3 9 27 81
[4,] 4 16 64 256
[5,] 5 25 125 625
Other packages and resources
This package is not the first package made for contrast coding, though to my knowledge it is the first to take a “tidy” approach to it.
The hypr
package (Rabe et al. 2020) takes a different approach, where the focus is on considering the hypothesis matrix and declaring specifically which comparisons you want to make, then the package can provide a corresponding matrix. I like hypr
a lot actually, but I find it a bit tedious when I know what the contrast matrix should look like but I have to type out the comparisons; still better than matrix
calls though.
The emmeans
package (Lenth 2022) is extremely useful for making pairwise comparisons, but is capable of a lot more as well. You can see its vignette on contrasts here.
The multcomp
package (Hothorn, Bretz, and Westfall 2008) is useful for simultaneous inference, which seeks to extend workflows for multiple comparisons.
I haven’t used the contrasts
package (O’Callaghan 2021) very much, but judging from its vignette here it seems like it extends the rms
package’s contrast
function (Harrell Jr 2022). It seems useful for calculating different comparisons after a model is run, but its usage isn’t very transparent to me on first glance.
While not a package, this page from UCLA pops up a lot when people discuss contrast coding. It’s very useful, and I used it as a starting point for implementing different contrast functions. However, I will note that I don’t follow its naming conventions.
Code
sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] palmerpenguins_0.1.1 purrr_1.0.0 dplyr_1.1.0
[4] contrastable_0.1.0
loaded via a namespace (and not attached):
[1] pillar_1.8.1 compiler_4.2.2 tools_4.2.2 digest_0.6.31
[5] jsonlite_1.8.4 evaluate_0.20 lifecycle_1.0.3 tibble_3.1.8
[9] gtable_0.3.1 pkgconfig_2.0.3 rlang_1.0.6 cli_3.5.0
[13] rstudioapi_0.14 yaml_2.3.7 xfun_0.37 fastmap_1.1.0
[17] withr_2.5.0 stringr_1.5.0 knitr_1.42 sass_0.4.5
[21] generics_0.1.3 vctrs_0.5.2 htmlwidgets_1.6.1 grid_4.2.2
[25] tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.3
[29] rmarkdown_2.20 ggplot2_3.4.0 magrittr_2.0.3 scales_1.2.1
[33] htmltools_0.5.4 MASS_7.3-58.1 gt_0.8.0 colorspace_2.0-3
[37] utf8_1.2.2 stringi_1.7.8 munsell_0.5.0 crayon_1.5.2
References
Footnotes
One might also see main effects referred to as marginal effects and simple effects referred to as conditional effects. For the latter, the idea is that, for example, the “effect of syntax” is the effect of syntax given the reference level of the other categorical predictor. The main effect of syntax would be the average of the syntax effect for the L1 group and the syntax effort for the L2 group.↩︎
You can view your default contrasts using
options('contrasts')
, where the unordered default is typicallycontr.treatment
and the ordered default is typicallycontr.poly
. I do not recommend changing these defaults because even though you may save some time for yourself, if you forget to explicitly mention that you changed the defaults then your analyses won’t be reproducible by the vast majority of others who have not changed the deafults.↩︎If it were really up to me, I would call it “centered pairwise coding” since you get pairwise comparisons with the intercept centered on the grand mean, but it doesn’t quite roll off the tongue and it’s a bit wordy to type out.↩︎
Not all modeling functions use the
contrasts
argument,brms::brm
is one example. In these cases you must set the contrast matrix to the factor, henceset_contrasts
is typically more useful for the actual modeling part whileenlist_contrasts
is useful for showing matrices.↩︎
Citation
@online{sostarics2022,
author = {Sostarics, Thomas},
title = {Contrastable},
date = {2022-07-13},
url = {https://tsostaricsblog.netlify.app/posts/contrastable/},
langid = {en}
}