5 Categorical Variables

While SEM was initially derived to consider only continuous variables (and indeed most applications still do), it’s often the case–especially in ecology–that the observed variables are discrete. For example: binary (yes/no, failure/success, etc.), nominal (site 1, site 2), or ordinal levels (small < medium < large). Newer advances in modeling allow for the incorporation, either directly or in a more roundabout fashion, of categorical variables into SEM.

There are two way that categorical variables could be included: as exogenous (predictors) or endogenous (responses). We will deal with the simpler case of exogenous categorical variables first, as they pose not so much of a computational issue but a conceptual one.

5.1 Introduction to Exogenous Categorical Variables

Recall that a linear regression predicting y has the following standard form:

\[y = \alpha + \beta_{1}*x_{1} + \epsilon\]

where \(\alpha\) is the intercept, \(\beta_{1}\) is the slope of the effect of \(x\) on y, and \(\epsilon\) is the residual error.

When \(x\) is continuous, the intercept \(\alpha\) is intepreted as the value of y when \(x\) = 0. All good.

For categorical factors, the intercept \(\alpha\) has a different interpretation. Consider a value of \(x\) with \(k\) levels. Since the levels of \(x\) are discrete and presumably can never assume a value of 0, \(\alpha\) is instead the mean value of y at the ‘reference’ level of \(x\). (In R, the reference level is the first level alphabetically, although this can be reset manually.) The regression coefficients \(\beta_{k}\) are therefore the effect of each other level relative to the reference level. So for \(k\) levels, there are \(k - 1\) coefficients estimated with the additional \(\alpha\) term reflecting the \(k\)th level.

Another way to think about this phenomenon is using so-called ‘dummy’ variables. Imagine each level was broken into a separate variable with a value of 0 or 1: a two-level factor with levels “a” and “b” would then become two factors “a” and “b” each with the levels 0 or 1. (In R, this would mean transposing rows as columns.)

Now imagine setting all the values of these dummy variables to 0 to estimate the intercept: this would imply the total absence of the factor, which is not a state. Another way of thinking about this is that the dummy variables are linearly dependent: if “a = 1” then by definition “b = 0” as the response variable cannot occupy the two states simultaneously. Hence the need to set one level as the reference, so that the effect of “a” can be interpreted relative to the absence of “b”, and also why you don’t recover as many coefficients as there are levels. I once heard it state that one level had to “fall on the sword” so that we can estimate the other levels.

This behavior presents a challenge for parameterizing path diagrams: there is not a single coefficient for the path from \(x\) -> \(y\), nor are there enough coefficients to populate a separate arrow for each level of \(x\) (because one level must serve as the reference).

There are a few potential solutions:

  • for binary variables, set the values as 0 or 1 and model as numeric, which would yield a single coefficient representing the expected change in \(y\) as \(x\) changes from state “0” to the other state “1.”

  • for ordinal variables, set the values depending on the order of the factor, e.g., small = 1 < medium = 2 < large = 3, and then model as numeric, which would also yield a single coefficient represented the expected change in \(x\) as you climb the ordinal ladder from smallest, to medium, and so on.

  • create dummy variables for each level: this is procedurally the same as above (splitting levels into \(k\) - 1 separate variables that have a state of or/1). The key here is not to create \(k\) variables, to avoid the issue raised above about dependence among levels. This is the default behavior of lavaan.

This approach becomes prohibitive with large number of categories or levels and can greatly increase model complexity. Moreover, each level is treated as an independent variable in the tests of directed separation, and thus will inflate the degrees of freedom in a piecewise application.

  • for suspected interactions with categorical variables, a multigroup analysis is required. In this case, the same model is fit for each level of the factor, with potentially different coefficients (see the following chapter on Multigroup Modeling).

  • test for the effect of the categorical variable using ANOVA, but do not report a coefficient. This approach would indicate whether a factor is important (i.e., whether the levels significantly differ with respect to the response), but omits important information about which levels and the direction and magnitude of change. For example, does a significant treatment effect imply an increase or decrease in the response, and by how much? For this reason, such an approach is valid but not ideal.

A alternate approach draws on this final point and involves testing and reporting the model-estimated or marginal means.

5.2 Exogenous Categorical Variables as Marginal Means

All models can be used for prediction. In multiple regression, the predicted values of one variable are often computed while holding the values of other variables at their mean. Marginal means are the averages of these predictions. In other words, they are the expected average value of one predictor given the other co-variables in the model.

For categorical variables, marginal means are particularly useful because they provide an estimated mean for each level of each factor.

Consider a simple example with a single response and two groups “a” and “b”:

set.seed(111)

dat <- data.frame(y = runif(100), group = letters[1:2])

model <- lm(y ~ group, dat)

summary(model)
## 
## Call:
## lm(formula = y ~ group, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48473 -0.21466 -0.01238  0.19715  0.54995 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.44677    0.03871  11.541   <2e-16 ***
## groupb       0.08551    0.05475   1.562    0.122    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2737 on 98 degrees of freedom
## Multiple R-squared:  0.02429,    Adjusted R-squared:  0.01433 
## F-statistic:  2.44 on 1 and 98 DF,  p-value: 0.1215

Note that the summary output gives a simple coefficient, which is the effect of group “b” on y in the absence of group “a”. As we established above, the intercept is simply the average of y in group “a”:

summary(model)$coefficients[1, 1]
## [1] 0.4467679
mean(subset(dat, group == "a")$y)
## [1] 0.4467679

The marginal means are the expected average value of y in group “a” AND group “b”.

predict(model, data.frame(group = "a"))
##         1 
## 0.4467679
predict(model, data.frame(group = "b"))
##       1 
## 0.53228

Because this is a simple linear regression, these values are simply the means of the two subsets of the data, because they are not controlling for any other covariates:

mean(subset(dat, group == "a")$y)
## [1] 0.4467679
mean(subset(dat, group == "b")$y)
## [1] 0.53228

Let’s see what happens we add a continuous covariate:

dat$x <- runif(100)

model <- update(model, . ~ . + x)

Here, the marginal mean must be evaluated while holding the covariate \(x\) at its mean value:

predict(model, data.frame(group = "a", x = mean(dat$x)))
##         1 
## 0.4450597
mean(subset(dat, group == "a")$y)
## [1] 0.4467679

You’ll note that this value is now different than the mean of the subset of the data because, again, it controls for the effect of \(x\) on \(y\).

This procedure gets increasingly complicated with both the number of factor levels and the number of covariates. The emmeans package provides an easy way to compute marginal means:

library(emmeans)

emmeans(model, specs = "group") # where specs is the variable or list of variables whose means are to be estimated
##  group emmean     SE df lower.CL upper.CL
##  a      0.445 0.0389 97    0.368    0.522
##  b      0.534 0.0389 97    0.457    0.611
## 
## Confidence level used: 0.95

You’ll note that the output value for group “a” gives the same as using the predict function above, but also returns the marginal mean for group “b” while also controlling for \(x\):

predict(model, data.frame(group = "b", x = mean(dat$x)))
##         1 
## 0.5339882

and so is a handy wrapper for complex models.

The emmeans function goes onto to provide lower and upper confidence intervals, which provides an additional level of information, namely whether each mean differs significantly from zero. Coupled with ANOVA test for differences among categories, the marginal means provide key information that is otherwise lacking, namely whether and how the response value changes based on the factor level. It is import

The emmeans package provides additional functionality by conducting post-hoc tests of differences among the means of each factor level:

emmeans(model, list(pairwise ~ group))
## $`emmeans of group`
##  group emmean     SE df lower.CL upper.CL
##  a      0.445 0.0389 97    0.368    0.522
##  b      0.534 0.0389 97    0.457    0.611
## 
## Confidence level used: 0.95 
## 
## $`pairwise differences of group`
##  1     estimate     SE df t.ratio p.value
##  a - b  -0.0889 0.0552 97  -1.611  0.1105

You’ll note a second output which is the pairwise contrast between the means of groups “a” and “b” with an associated significance test.

These pairwise Tukey tests provide the final level of information, which is whether the response in each level varies significantly from the other levels.

To adapt this to SEM, the coefs function in piecewiseSEM adopts a two-tiered approach by first computing the significance of the categorical variable using ANOVA, and then reports the marginal means and post-hoc tests:

library(piecewiseSEM)

coefs(model)
##   Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate    
## 1        y         x   0.0567     0.093 97     0.6094  0.5437       0.0613    
## 2        y     group        -         -  1     2.5946  0.1105            -    
## 3        y group = a   0.4451    0.0389 97    11.4300  0.0000            - ***
## 4        y group = b    0.534    0.0389 97    13.7139  0.0000            - ***

In this output, we retrieve the normal output for the continuous \(x\) including a standardized effect size. The significance test from the ANOVA is reported in the row corresponding to the group effect, and below that are the marginal means for each level of the grouping factor. Note that there are no standardized estimates for either the ANOVA effect or the marginal means, because as we have established, these are not linear coefficients and therefore cannot be standardized as usual.

This solution provides a measure of whether the path between the exogenous categorical variable and the response is significant as well as parameters for each level in the form of the model-estimated marginal means.

5.3 Exogenous Categorical Variables as Marginal Means: A Worked Example

Let’s consider an example from Bowen et al. (2017). In this study, the authors were interested in how different microbiomes of the salt marsh plant Phragmites australis drive ecosystem functioning, and ultimately the production of aboveground biomass. In this case, they considered three microbial communities: those from a native North American lineage, from Gulf Coast lineage, and an introduced lineage. There were additional genotypes within each community type, necessitating the application of random effects to account for intraspecific variation.

We will fit a simplified version of their full path diagram, focusing only on aboveground biomass (although they test the effect on belowground biomass in their study as well).

In this case, the variable “Phragmites status” corresponds to the three community types and can’t be represented using a single coefficient. Thus, the marginal-means approach is ideal to elucidate the effect of each community type on both proximate and ultimate ecosystem properties while testing the overall significance of this path.

Let’s read in the data and construct the model:

bowen <- read.csv("https://raw.githubusercontent.com/jslefche/sem_book/master/data/bowen.csv")

bowen <- na.omit(bowen)

library(nlme)

bowen_sem <- psem(
  lme(observed_otus ~ status, random = ~1|Genotype, data = bowen, method = "ML"),
  lme(RNA.DNA ~ status + observed_otus, random = ~1|Genotype, data = bowen, method = "ML"), 
  lme(below.C ~ observed_otus + status, random = ~1|Genotype, data = bowen, method = "ML"), 
  lme(abovebiomass_g ~ RNA.DNA + observed_otus + belowCN + status, random = ~1|Genotype, data = bowen, method = "ML"),
  data = bowen
)

And let’s retrieve the output:

summary(bowen_sem, .progressBar = FALSE)
## Warning: Categorical or non-linear variables detected. Please refer to
## documentation for interpretation of Estimates!
## 
## Structural Equation Model of bowen_sem 
## 
## Call:
##   observed_otus ~ status
##   RNA.DNA ~ status + observed_otus
##   below.C ~ observed_otus + status
##   abovebiomass_g ~ RNA.DNA + observed_otus + belowCN + status
## 
##     AIC
##  1110.030
## 
## ---
## Tests of directed separation:
## 
##                   Independ.Claim Test.Type DF Crit.Value P.Value   
##    observed_otus ~ belowCN + ...      coef 57     1.4405  0.1552   
##          RNA.DNA ~ belowCN + ...      coef 56     1.3740  0.1749   
##          below.C ~ belowCN + ...      coef 56     2.7225  0.0086 **
##          below.C ~ RNA.DNA + ...      coef 56    -0.1762  0.8607   
##   abovebiomass_g ~ below.C + ...      coef 54    -0.4540  0.6516   
## 
## --
## Global goodness-of-fit:
## 
## Chi-Squared = 11.484 with P-value = 0.043 and on 5 degrees of freedom
## Fisher's C = 17.877 with P-value = 0.057 and on 10 degrees of freedom
## 
## ---
## Coefficients:
## 
##         Response           Predictor  Estimate Std.Error DF Crit.Value P.Value
##    observed_otus              status         -         -  2     5.9589  0.0508
##    observed_otus     status = native 2258.6564  105.8178 12    21.3448  0.0000
##    observed_otus status = introduced   2535.07  131.2216 14    19.3190  0.0000
##    observed_otus   status = invasive 2541.4984   56.6887 12    44.8326  0.0000
##          RNA.DNA       observed_otus         0         0 57     2.1583  0.0351
##          RNA.DNA              status         -         -  2     9.3480  0.0093
##          RNA.DNA   status = invasive    0.7112    0.0118 12    60.4511  0.0000
##          RNA.DNA status = introduced    0.7305    0.0264 14    27.6880  0.0000
##          RNA.DNA     status = native    0.7844    0.0216 12    36.2988  0.0000
##          below.C       observed_otus     9e-04     3e-04 57     2.6546  0.0103
##          below.C              status         -         -  2    17.1995  0.0002
##          below.C status = introduced   42.6366    0.3673 14   116.0774  0.0000
##          below.C   status = invasive   43.4004    0.1594 12   272.2299  0.0000
##          below.C     status = native   44.4975    0.3056 12   145.6071  0.0000
##   abovebiomass_g             RNA.DNA   -1.8517    1.8893 55    -0.9801  0.3313
##   abovebiomass_g       observed_otus    -2e-04     2e-04 55    -0.7521  0.4552
##   abovebiomass_g             belowCN     0.005    0.0041 55     1.2026  0.2343
##   abovebiomass_g              status         -         -  2    12.9519  0.0015
##   abovebiomass_g     status = native    1.7489    0.2206 12     7.9272  0.0000
##   abovebiomass_g   status = invasive    1.8807    0.1041 12    18.0629  0.0000
##   abovebiomass_g status = introduced    2.7024     0.232 14    11.6457  0.0000
##   Std.Estimate    
##              -    
##              - ***
##              - ***
##              - ***
##          0.135   *
##              -  **
##              - ***
##              - ***
##              - ***
##         0.2913   *
##              - ***
##              - ***
##              - ***
##              - ***
##        -0.1428    
##        -0.0905    
##         0.1356    
##              -  **
##              - ***
##              - ***
##              - ***
## 
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
## 
## ---
## Individual R-squared:
## 
##         Response method Marginal Conditional
##    observed_otus   none     0.10        0.19
##          RNA.DNA   none     0.31        0.81
##          below.C   none     0.26        0.32
##   abovebiomass_g   none     0.22        0.28

In this case, it appears that the model fits the data well enough based on the Fisher’s C statistic (P = 0.057), which is what the original authors used. Note that the likelihood-based \(\chi^2\) statistic actually implies poor fit (P = 0.043). In this case, examining the d-sep tests suggests the inclusion of the path from \(belowCN\) to \(below.C\) might improve the fit. However, we are applying a tool that was not available to the original authors, so we will proceed as they did and assume adequate fit.

The linkage between microbial community type (\(status\)) and richness (\(observed_otus\)) is non-significant, but the other paths are significant. Examination of the marginal means indicates microbial activity (\(RNA.DNA\)) and belowground carbon (\(below.C\)) are generally highest in Phragmites with native microbial communities based on the post-hoc tests. However, none of these properties appear to influence the ultimate production of biomass (\(abovebiomass_g\)). Rather, that property appears to be entirely controlled by the plant microbiome type (\(status\)): those with the introduced microbial community have significantly higher aboveground biomass based on their marginal mean after controlling for microbial activity and soil nutrients.

Thus, despite a multi-level categorical predictor (\(status\)), the two-step procedure of ANOVA and calculation of marginal means allows for a more mechanistic understanding of the drivers of plant biomass in this species.

5.4 Endogenous Categorical Variables

Endogenous categorical variables are far trickier, and at the moment, are not implemented in piecewiseSEM.

In the case of endogenous categorical variables in a piecewise framework, there are really only two solutions:

  • for binary variables, set the values as 0 or 1 and model as numeric, which would yield a single coefficient.

  • for ordinal variables, set the values depending on the order of the factor, e.g., small = 1 < medium = 2 < large = 3, and then model as numeric, which would yield a single coefficient.

Nominal variables (i.e., levels are not ordered) could be modeled using multinomial regression, although this method would have to be executed by hand. An alternative is to use the factor levels to construct a composite variable, the subject of a later chapter. lavaan provides a robust alternative in the form of confirmatory factor analysis (see http://lavaan.ugent.be/tutorial/cat.html). In piecewiseSEM, composites must be constructed by hand, although this procedure is not hugely prohibitive.

5.5 References

Bowen, J. L., Kearns, P. J., Byrnes, J. E., Wigginton, S., Allen, W. J., Greenwood, M., … & Meyerson, L. A. (2017). Lineage overwhelms environmental conditions in determining rhizosphere bacterial community structure in a cosmopolitan invasive plant. Nature communications, 8(1), 433.