Structural equation modeling (SEM) is among the fastest growing statistical techniques in ecology and evolution, and provides a new way to explore and quantify ecological systems. SEM unites multiple variables in a single causal network, thereby allowing simultaneous tests of multiple hypotheses. The idea of causality is central to SEM as the technique implicitly assumes that the relationships among variables represent causal links. Because variables can be both predictors and responses, SEM is also a useful tool for quantifying both direct and indirect (cascading) effects.

Piecewise SEM (or confirmatory path analysis) expands upon traditional SEM by introducing a flexible mathematical framework that can incorporate a wide variety of model structures, distributions, and assumptions. These include: interactions and non-normal responses, random effects and hierarchical models, and alternate correlation structures (including phylogenetic, spatial, and temporal).

This release is version 2.0 of the package and contains substantial updates to both the syntax and the underlying calculations. All functions have been replaced and rewritten from the ground up.

The first part of this vignette will briefly introduce the concepts behind piecewise SEM. The second part will introduce the new syntax using a worked example. The final part will briefly compare the old and new versions of the package.

## 1. An Introduction to Structural Equation Modeling

Broadly, structural equation modeling (SEM) unites a suite of variables in a single network. They are generally presented using box-and-arrow diagrams denoting directed (causal) relationships among variables:

Those variables that exist only as predictors in the network are referred to as exogenous, and those that are predicted (at any point) as endogenous. Exogenous variables therefore only ever have arrows coming out of them, while endogenous arrows have arrows coming into them (which does not preclude them from having arrows come out of them as well). This vocabulary is important when considering some special cases later.

In traditional SEM, the relationships among variables (i.e., their linear coefficients) are estimated simultaneously in a single variance-covariance matrix. This approach is well developed but can be computationally intensive (depending on the sizes of the v-cov matrix) and additionally assumes independence and normality of errors, two assumptions that are generally violated in ecological research.

Piecewise structural equation modeling (SEM), also called confirmatory path analysis, was proposed in the early 2000s by Bill Shipley as an alternate approach to traditional variance-covariance based SEM. In piecewise SEM, each set of relationships is estimated independently (or locally). This proces decomposes the network into the corresponding simple or multiple linear regressions for each response, each of which are evaluated separately, and then combined later to generate inferences about the entire SEM. This approach has two consequences: 1. Increasingly large networks can be estimated with ease compared to a single vcov matrix (because the approach is modularized), and 2. Specific assumptions about the distribution and covariance of the responses can be addressed using typical extensions of linear regression, such as fixed covariance structures, random effects, and other sophisticated modeling techniques.

Unlike traditional SEM, which uses a $$\chi^2$$ test to compare the observed and predicted covariance matrices, the goodness-of-fit of a piecewise structural equation model is obtained using ‘tests of directed separation.’ These tests evaluate the assumption that the specific causal structure reflects the data. This is accomplished by deriving the ‘basis set,’ which is the smallest set of independence claims obtained from the SEM. These claims are relationships that are unspecified in the model, in other words paths that could have been included but were omitted because they were deemed to be biologically or mechanistically insigificant. The tests ask whether these relationships can truly be considered independent (i.e., their association is not statistically significant within some threshold of acceptable error, typically $$\alpha$$=0.05) or whether some causal relationship may exist as indicated by the data.

For instance, the preceding example SEM contains 4 specified paths (solid, black) and 2 unspecified paths (dashed, red), the latter of which constitute the basis set:

In this case, there are two relationships that need to be evaluated: y3 and x1, and y3 and y2. However, there are additional influences on y3, specifically the directed path from y2. Thus, the claims need to be evaluated for ‘conditional independence,’ i.e. that the two variables are independent conditional on the already specified influences on both of them. This also pertains to the predictors of y2, including the potential contributions of x1. So the full claim would be: y2 | y3 (y1, x1), with the claim of interest separated by the | bar and the conditioning variable(s) following in parentheses.

As the network grows more complex, however, the independence claims only consider variables that are immediately ancestral to the primary claim (i.e., the parent nodes). For example, if there was another variable predicting x1, it would not be considered in the independence claim between y3 and y2 since it is >1 node away in the network.

The independence claims are evaluated by fitting a regression between the two variables of interest with any conditioning variables included as covariates. Thus, the claim above y2 | y3 (y1, x1) would be modeled as y3 ~ y2 + y1 + x1 . These regressions are constructed using the same assumptions about y3 as specified in the actual structural equation model. So, for instance, if y3 is a hierarchically sampled variable predicted by y1, then same hierarchical structure would carry over to the test of directed separation of y3 predicted by y2.

The P-values of the conditional independence tests are then combined in a single Fisher’s C statistic using the following equation:

$C = -2\sum_{i=1}^{k}ln(p_{i})$

This statistic is $$\chi^2$$-distributed with 2k degrees of freedom, with k being the number of independence claims in the basis set.

Shipley (2013) also showed that the the C statistic can be used to compute an AIC score for the SEM, so that nested comparisons can be made in a model selection framework:

$AIC = C + 2K$

where K is the likelihood degrees of freedom. A further variant, $$AIC_c$$, can be obtained by adding an additional penalty based on sample size:

$AIC_c = C + 2K\frac{n}{(n - K - 1)}$

The piecewiseSEM package automates the derivation of the basis set and the tests of directed separation, as well as extraction of path coefficients based on the user-specified input.

## 2. An Example using piecewiseSEM

### 2.1 Worked example

Let’s make up some fake data corresponding to the path diagram above:

dat <- data.frame(x1 = runif(50), y1 = runif(50), y2 = runif(50), y3 = runif(50))

And we will use piecewiseSEM to fit the model. The primary function is psem and we supply the regressions corresponding to the relationships specified in the path diagram, separated by commas as in a list:

# Load required libraries
library(piecewiseSEM)
##
##   This is piecewiseSEM version 2.0.
##
##   If you have used the package before, it is strongly recommended you read Section 3 of the vignette('piecewiseSEM') to familiarize yourself with the new syntax.
##
##   Questions or bugs can be addressed to <jlefcheck@bigelow.org>
model <- psem(lm(y1 ~ x1, dat), lm(y1 ~ y2, dat), lm(y2 ~ x1, dat), lm(y3 ~ y1, dat))
## Error: Duplicate responses detected in the model list. Collapse into single multiple regression!

You’ll note that this formulation produces an error because we have incorrectly broken down the component regressions. A common mistake is to list each path separately, but the proper specification is to collapse multiple pathways into a single multiple regression if the response is the same. Thus, the properly specified SEM becomes:

model <- psem(lm(y1 ~ x1 + y2, dat), lm(y2 ~ x1, dat), lm(y3 ~ y1, dat))

To evaluate the model, we call summary on the psem object.

summary(model, .progressBar = F)
## $name ## [1] "model" ## ##$call
## [1] "y1 ~ x1 + y2\n  y2 ~ x1\n  y3 ~ y1"
##
## $dTable ## Independ.Claim Estimate Std.Error DF Crit.Value P.Value ## 1 y3 ~ x1 + ... 0.0220 0.1414 47 0.1557 0.8769 ## 2 y3 ~ y2 + ... -0.2119 0.1431 46 -1.4807 0.1455 ## ##$Cstat
##   Fisher.C df P.Value
## 1    4.118  4    0.39
##
## $IC ## AIC AICc BIC K n ## 1 24.118 30.921 43.238 10 50 ## ##$coefficients
##   Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1       y1        x1  -0.0061    0.1297 47    -0.0473  0.9625      -0.0064
## 2       y1        y2   0.3511    0.1226 47     2.8635  0.0062       0.3854
## 3       y2        x1  -0.0088    0.1527 48    -0.0575  0.9544      -0.0083
## 4       y3        y1   0.2665    0.1452 48     1.8350  0.0727       0.2560
##
## 1
## 2 **
## 3
## 4
##
## $R2 ## Response family link method R.squared ## 1 y1 gaussian identity none 0.15 ## 2 y2 gaussian identity none 0.00 ## 3 y3 gaussian identity none 0.07 ## ## attr(,"class") ## [1] "summary.psem" The output should be familiar to anyone who has evaluated a linear model in R previously. It shows the call of the model (the component equations), the AIC and BIC scores (derived from that C statistic), and then the tests of directed separation. The last column of the table reports the P-values, which are summarized using the above equation to yield the global goodness-of-fit below. The next table reports the path coefficients, including the standardized values (scaled by standardized deviations). Finally, the individual R-squared values of each regression is given, to aid in evaluation of the model fit. ### 2.2 Standardized coefficients Standardization of model coefficients is useful for making comparisons about the relative strengths of predictors in a multiple regression. In a network approach, it allows effects to also be compared across multiple responses. Standardized coefficients are also necessary for the calculation of indirect and total effects (because predictors may occur on wildly different scales). Because of their utility, a variety of approaches to coefficient standardization have been proposed, many of which are implemented here. #### 2.2.1 The null example: no standardization Let’s first create an example dataset: # Create fake data.frame coefs.data <- data.frame( y = runif(100), x1 = runif(100), x2 = runif(100) ) # Evaluate linear model model <- lm(y ~ x1, coefs.data) The function for returning coefficients in piecewiseSEM is coefs. While standardization is useful, it may be unnecessary or unwanted in certain circumstances. In these cases, one can specify standardize = "none" as an argument to coefs, which will return the raw (unstandardized) coefficients. coefs(model, standardize = "none") ## Response Predictor Estimate Std.Error DF Crit.Value P.Value ## 1 y x1 9e-04 0.1054 98 0.009 0.9928 # These coefficients are identical to those returned by summmary summary(model)$coefficients
##                 Estimate Std. Error     t value    Pr(>|t|)
## (Intercept) 0.4876552766 0.05572766 8.750686922 6.19140e-14
## x1          0.0009478366 0.10542595 0.008990544 9.92845e-01

To return the intercepts using coefs, specify the intercepts = TRUE argument:

coefs(model, standardize = "none", intercepts = TRUE)
##   Response   Predictor Estimate Std.Error DF Crit.Value P.Value
## 1        y (Intercept)   0.4877    0.0557 98     8.7507  0.0000 ***
## 2        y          x1   0.0009    0.1054 98     0.0090  0.9928

#### 2.2.2 Standardization: scaling by standard deviations

The most typical implementation of standardization is placing the coefficients in units of standard deviations of the mean. This is accomplished by scaling the coefficients $$\beta$$ by the ratio of the standard deviation of x over the standard deviation of y:

$\beta_{std} = \beta*\left( \frac{sd_x}{sd_y} \right)$ We can do this manually for our example dataset:

# Obtain the raw coefficient from the coefficient table
B <- summary(model)$coefficients[2, 1] # Compute the standard deviation of the independent variable sd.x <- sd(coefs.data$x1)

# Compute the standard deviation of the dependent variable
sd.y <- sd(coefs.data$y) # Scale Beta B.sdscaled <- B * sd.x/sd.y The argument for standardization based on standard deviations is standardize = "scale": coefs(model, standardize = "scale") ## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate ## 1 y x1 9e-04 0.1054 98 0.009 0.9928 9e-04 ## ## 1 # Compare to hand-calculated value B.sdscaled ## [1] 0.0009081817 Note that the coefs function will return both standardized and unstandardized coefficients when specifying any argument other than standardize = "none". #### 2.2.3 Standardization: scaling by relevant ranges Instead of scaling by the ratio of the standard deviations, one can scale by the ‘relevant’ range of the data. The default range standardization considers the full range of variation exhibited by the data. We can again compute the range standardization by hand: # Calculate range for the independent variable range.x <- diff(range(coefs.data$x1))

# Calculate range for the independent variable
range.y <- diff(range(coefs.data$y)) # Scale Beta B.range <- B * range.x/range.y The argument for standardization based on ranges is standardize = "range": coefs(model, standardize = "range") ## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate ## 1 y x1 9e-04 0.1054 98 0.009 0.9928 9e-04 ## ## 1 # Compare to hand-calculated value B.range ## [1] 0.0009250381 If one does not wish to use the full range of the data, and instead restrict the range to a more ‘relevant’ subset of the data, this can be accomplished by providing a named list to the standardize = argument. The names should be x and y and each entry should be a vector of length 2 with the first entry being the minimum and the second entry being the maximum values of the relevant range. # Consider the range 0.1 - 0.8 for x relrange.x <- 0.8-0.1 # Consider the range 0.3-0.6 for y relrange.y <- 0.6-0.3 # Scale Beta B.relrange <- B * relrange.x/relrange.y # Compare to automated calculation coefs(model, standardize = list(x1 = c(0.1, 0.8), y = c(0.3, 0.6))) ## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate ## 1 y x1 9e-04 0.1054 98 0.009 0.9928 0.0022 ## ## 1 B.relrange ## [1] 0.002211619 #### 2.2.4 Standardization for Binary Response Models For generalized linear models, the response variables are inherently non-linear. The common solution is to apply a link function to linearize the response, and derive the parameter estimate $$\beta$$ on this new linear (link) scale. Common link functions include logit and probit for binary responses, and log for Poisson, although many others exist. This transformation, however, means that the modeled response is not actually observed, and thus the true error is not known. Obtaining the standard deviation of the response used in the calculation of standardized coefficients requires further assumptions about the distribution-specific variance. We implement two solutions for obtaining the error variance to produce standardized coefficients for GLM, focusing for the moment on binary response models: the latent theoretic (standardize.type = "latent.linear") and the observation error approach (standardize.type = "Menard.OE"). In the latent theoretic approach, we assume a fixed error variance for the binomial distribution based on the link function: for logit, it is $$\pi$$^2/3 and for the probit, it is 1. This value is added to the predicted fits on the linear (link) scale and then the square-root of the variance of these values is taken to obtain the standard deviation of y. In the observation error approach, a rough estimate of the error variance is obtained through the calculation of a pseudo-R^2, which is simply the correlation between the observed and predicted fits (in this case, in the original units). The error variance is again added to the variance of the predicted fits and the square-root is taken to obtain the standard deviation of y. Let’s work through a simple example where we first compute the standardized coefficient by hand, then compare to the output of piecewiseSEM: # Create fake binomial response coefs.data$y.binom <- rbinom(100, 1, 0.5)

# Fit using GLM
glm.model <- glm(y ~ x1, "binomial", coefs.data)

# Extract linear beta
Beta.glm <- summary(glm.model)$coefficients[2, 1] And apply the latent theoretic (standardize.type = "latent.linear") approach: # Extract predicted values on the link scale preds <- predict(glm.model, type = "link") # Compute sd of error variance using theoretical variances sd.y.LT <- sqrt(var(preds) + pi^2/3) # Compute sd of x sd.x <- sd(coefs.data$x1)

# Compare to automated output
coefs(glm.model, standardize.type = "latent.linear")
##   Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1        y        x1   0.0038    0.7345 98     0.0052  0.9959        6e-04
##
## 1
Beta.glm * sd.x / sd.y.LT
## [1] 0.0005725077

### 2.3 GLMs in pSEM

A problematic case arises when intermediate endogenous variables are non-normally distributed. Consider the following SEM:

In this SEM, there are two independence claims:

1. y3 | x1 (y1, y2)

2. y2 | y1 (x1)

Considering the second independence claim, in a Gaussian world, the significance value is the same whether the test is conducted as y2 | y1 (x1) or y1 | y2 (x1). This is NOT true, however, when one or both of the variables are modeled using a generalized linear model (GLM) fit to a non-normal distribution. This is because the response is now transformed via the link function (see Section 2.2). This transformations means the P-value obtained by regressing y1 against y2 is NOT the same as the one obtained by regressing y2 against y1.

The following example will show this is the case:

# Generate fake data
glmdat <- data.frame(x1 = runif(50), y1 = rpois(50, 10), y2 = rpois(50, 50), y3 = runif(50))

# Extract P-values
summary(lm(y1 ~ y2 + x1, glmdat))$coefficients[2, 4] ## [1] 0.1737639 summary(lm(y2 ~ y1 + x1, glmdat))$coefficients[2, 4]
## [1] 0.1737639
# Repeat but model y1 and y2 and Poisson-distributed
summary(glm(y1 ~ y2 + x1, "poisson", glmdat))$coefficients[2, 4] ## [1] 0.134332 summary(glm(y2 ~ y1 + x1, "poisson", glmdat))$coefficients[2, 4]
## [1] 0.1767188

This effect is problematic because the d-sep tests are wholly dependent on the significance value. If the P-value is biased based on the direction of the test, then the goodness-of-fit of the model can be over- or underestimated.

piecewiseSEM version 2.0 solves this by providing three options to the user.

1. One can specify the directionality of the test if, for instance, it makes greater biological sense to test y1 against y2 instead of the reverse.

2. One can remove that path from the basis set and instead specify it as a correlated error using %~~%.

3. One can conduct both tests and choose the most conservative (i.e., lowest) P-value.

These options are returned by summary in the event the above scenario is identified in the SEM:

# Generate fake data
glmdat <- data.frame(x1 = runif(50), y1 = rpois(50, 10), y2 = rpois(50, 50), y3 = runif(50))

# Construct SEM
glmsem <- psem(
glm(y1 ~ x1, "poisson", glmdat),
glm(y2 ~ x1, "poisson", glmdat),
lm(y3 ~ y1 + y2, glmdat)
)

summary(glmsem)
## Error:
## Non-linearities detected in the basis set where P-values are not symmetrical.
## This can bias the outcome of the tests of directed separation.
##
## Offending independence claims:
##  y2 <- y1 *OR* y2 -> y1
##
## Option 1: Specify directionality using argument 'direction = c()' in 'summary'.
##
## Option 2: Remove path from the basis set by specifying as a correlated error using '%~~%' in 'psem'.
##
## Option 3 (recommended): Use argument 'conserve = TRUE' in 'summary' to compute both tests, and return the most conservative P-value.

In option 1, the directionality can be specified using direction = c() as an additional argument.

summary(glmsem, direction = c("y1 <- y2"), .progressBar = F)$dTable ## Independ.Claim Estimate Std.Error DF Crit.Value P.Value ## 1 y3 ~ x1 + ... -0.0055 0.1479 46 -0.0370 0.9707 ## 2 y2 ~ y1 + ... 0.0119 0.0067 47 1.7728 0.0763 In option 2, the SEM can be updated to remove that test by specifying it as a correlated error (see Section 2.4). summary(update(glmsem, y1 %~~% y2), .progressBar = F) ##$name
## [1] "update(glmsem, y1 %~~% y2)"
##
## $call ## [1] "y1 ~ x1\n y2 ~ x1\n y3 ~ y1 + y2\n y1 ~~ y2" ## ##$dTable
##    Independ.Claim Estimate Std.Error DF Crit.Value P.Value
## 1 y3  ~  x1 + ...  -0.0055    0.1479 46     -0.037  0.9707
##
## $Cstat ## Fisher.C df P.Value ## 1 0.06 2 0.971 ## ##$IC
##     AIC   AICc    BIC K  n
## 1 16.06 19.585 31.356 8 50
##
## $coefficients ## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate ## 1 y1 x1 -0.2505 0.1538 48 -1.6285 0.1034 NA ## 2 y2 x1 -0.0071 0.0700 48 -0.1009 0.9196 NA ## 3 y3 y1 -0.0238 0.0138 47 -1.7304 0.0901 -0.2482 ## 4 y3 y2 -0.0054 0.0061 47 -0.8883 0.3789 -0.1274 ## 5 ~~y1 ~~y2 0.2714 NA 47 1.9331 0.0296 0.2714 ## ## 1 ## 2 ## 3 ## 4 ## 5 * ## ##$R2
##   Response   family     link     method R.squared
## 1       y1  poisson      log nagelkerke      0.09
## 2       y2  poisson      log nagelkerke      0.00
## 3       y3 gaussian identity       none      0.09
##
## attr(,"class")
## [1] "summary.psem"

Note that the claim no longer appears in the section for the tests of directed separation.

Finally, option 3 can be invoked by specifying conserve = T as an additional argument

summary(glmsem, conserve = T, .progressBar = F)$dTable ## Independ.Claim Estimate Std.Error DF Crit.Value P.Value ## 1 y3 ~ x1 + ... -0.0055 0.1479 46 -0.0370 0.9707 ## 2 y2 ~ y1 + ... 0.0119 0.0067 47 1.7728 0.0763 The user should be vigilant for these kinds of situations and ensure that both the specified paths AND the independence claims all make biological sense. In the case where the underlying assumptions of the d-sep tests can bias the goodness-of-fit statistic, piecewiseSEM should automatically alert the user. ### 2.4 Correlated errors Correlated errors reflect the situation where the relationship among the two variables is not presumed to be causal and unidirectional, but rather that both are being driven by some underlying driver and are therefore appear correlated. Such a relationship is denoted by using a double-headed arrow: This behavior is specified in piecewiseSEM using the new operator %~~% in the psem function. We can fit the above SEM: cordat <- data.frame(x1 = runif(50), y1 = runif(50), y2 = runif(50), y3 = runif(50)) corsem <- psem( lm(y1 ~ x1, cordat), lm(y2 ~ x1, cordat), y1 %~~% y2, lm(y3 ~ y1 + y2, cordat) ) summary(corsem, .progressBar = F) ##$name
## [1] "corsem"
##
## $call ## [1] "y1 ~ x1\n y2 ~ x1\n y1 ~~ y2\n y3 ~ y1 + y2" ## ##$dTable
##    Independ.Claim Estimate Std.Error DF Crit.Value P.Value
## 1 y3  ~  x1 + ...   0.0476    0.1343 46     0.3543  0.7248
##
## $Cstat ## Fisher.C df P.Value ## 1 0.644 2 0.725 ## ##$IC
##      AIC   AICc    BIC  K  n
## 1 20.644 26.467 39.764 10 50
##
## $coefficients ## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate ## 1 y1 x1 0.2223 0.1372 48 1.6202 0.1117 0.2277 ## 2 y2 x1 0.0773 0.1244 48 0.6210 0.5375 0.0893 ## 3 ~~y1 ~~y2 -0.1824 NA 47 -1.2715 0.8951 -0.1824 ## 4 y3 y1 -0.1690 0.1332 47 -1.2687 0.2108 -0.1839 ## 5 y3 y2 -0.0902 0.1503 47 -0.6000 0.5514 -0.0870 ## ## 1 ## 2 ## 3 ## 4 ## 5 ## ##$R2
##   Response   family     link method R.squared
## 1       y1 gaussian identity   none      0.05
## 2       y2 gaussian identity   none      0.01
## 3       y3 gaussian identity   none      0.04
##
## attr(,"class")
## [1] "summary.psem"

In the case where the correlated error occurs between two exogenous variables, it is simply the raw bivariate correlation whose P-value is determined using modifications to the function cor.test. In the event the correlated error includes an endogenous variable, it is the partial correlation that removes the effect of any covariates.

In the above example, the correlated error removes the influence of x1 on both y1 and y2 before computing their correlation.

cor(resid(lm(y1 ~ x1, cordat)), resid(lm(y2 ~ x1, cordat)))
## [1] -0.1823569
cerror(y1 %~~% y2, corsem)
##   Response Predictor Estimate Std.Error DF Crit.Value P.Value
## 1     ~~y1      ~~y2  -0.1824        NA 47    -1.2715  0.8951

### 2.5 Nested models and AIC

As noted, Shipley (2013) used the Fisher’s C statistic to construct an AIC score to facilitate model comparison and selection. This can be accomplished in the piecewiseSEM package with one important distinction.

Let’s consider comparing the following models for the mediating role of y1:

One might think that the models could be coded like this, and then compared:

AICdat <- data.frame(x1 = runif(50), y1 = runif(50), y2 = runif(50), y3 = runif(50))

sem1 <- psem(
lm(y1 ~ x1, AICdat),
lm(y2 ~ y1, AICdat),
lm(y3 ~ y2, AICdat)
)

sem2 <- psem(
lm(y1 ~ x1, AICdat),
lm(y2 ~ y1, AICdat)
)

AIC(sem1, sem2)

However, this does not account for the potential missing relationships with y3 to be in the model, which is critical as AIC incorporates Fisher’s C, which is determined by the d-sep tests. y3 must be present in the d-sep tests to make the comparison fair (i.e., the models must be nested).

To do so, we can use the following syntax:

sem2new <- update(sem2, y3 ~ 1)

AIC(sem1, sem2new)
##   df    AIC
## x  9 25.459
## y  6 24.450

Now the comparison is fair and the model selection procedure is robust. In this case, the models have equivalent support (i.e., $$\delta$$AIC < 2).

Comparison of saturated models–ones that have no missing paths–and unsaturated ones using AIC is currently not possible, although we are looking into possible likelihood formulations in the absence of a C statistic.

## 3. Comparing Package Versions

The new version 2.0 of piecewiseSEM replaces all of the old functions from version 1.x. This section will walk the user through the same worked example included in version 1.x from Shipley (2009), emphasizing the new syntax.

### 3.1 Introduction to Shipley (2009)

Shipley included an example dataset in his 2009 paper ‘Confirmatory path analysis in a generalized multilevel context.’ The data are included in this version of the package and alternately hosted in Ecological Archives E090-028-S1 (DOI: 10.1890/08-1034.1). While not actual observations (the data are randomly generated), the hypotheses correspond to actual ecological phenomenon.

Briefly, the data concern a forest survey where 5 individual trees of a particular species are followed at 20 sites every year beginning in 1970. For each individual, there is a measurement of: the latitude of the site (lat) the degree days until break (DD) the Julian date (date of year) of bud break (Date) the increase in stem diameter per tree (Growth) *a binary variable indicating whether the tree is alive (1) or dead (0) (Survival)

Shipley notes that this model would be difficult to test using traditional SEM because it represents variation occurring at multiple levels (between sites, between individuals within sites, and between years within individuals within sites), and one response, survival, is not normally distributed.

Shipley hypothesizes that the data adhere to the following hypothesized causal structure:

This example was included as the primary worked dataset in version 1.x of piecewiseSEM.

### 3.2 Comparing versions in evaluating the Shipley’s SEM

In the previous version 1.x of the package, the model is constructed using a list and then supplied to various additional functions to measure fit, extract coefficients, and so on. Version 2.0 of the package uses the new function psem and uses summary to extract all that information at once.

Next, we will walk through the analysis of the Shipley data using version 1.x and version 2.0, and compare the output.

# Load required packages
library(nlme)
library(lme4)

data(shipley)

# Create list of structural equations
shipley.list <- list(

lme(DD ~ lat, random = ~ 1 | site / tree, na.action = na.omit,
data = shipley),

lme(Date ~ DD, random = ~ 1 | site / tree, na.action = na.omit,
data = shipley),

lme(Growth ~ Date, random = ~ 1 | site / tree, na.action = na.omit,
data = shipley),

glmer(Live ~ Growth + (1 | site) + (1 | tree),
family = binomial(link = "logit"), data = shipley)

)

Note the use of mixed effects models to account for the nested non-independence of replicates as well as the use of a generalized linear mixed effects model to account for the binomial distribution of survival.

Next, we extract the goodness-of-fit and path coefficients.

(old.fit <- sem.fit(shipley.list, shipley, .progressBar = F))
## Warning: sem.fit has been replaced. Use psem instead of list, and
## then call summary on that object
## Conditional variables have been omitted from output table for clarity (or use argument conditional = T)
## $missing.paths ## missing.path estimate std.error df crit.value p.value ## 1 Date ~ lat + ... -0.0091 0.1135 18 -0.0798 0.9373 ## 2 Growth ~ lat + ... -0.0989 0.1107 18 -0.8929 0.3837 ## 3 Live ~ lat + ... 0.0305 0.0297 NA 1.0281 0.3039 ## 4 Growth ~ DD + ... -0.0106 0.0358 1329 -0.2967 0.7667 ## 5 Live ~ DD + ... 0.0272 0.0271 NA 1.0042 0.3153 ## 6 Live ~ Date + ... -0.0466 0.0298 NA -1.5613 0.1184 ## ##$Fisher.C
##   fisher.c df p.value
## 1    11.53 12   0.484
##
## $AIC ## AIC AICc K n ## 1 49.53 50.069 19 1431 (old.coefs <- sem.coefs(shipley.list)) ## Warning: sem.coefs has been replaced. Use psem instead of list, and ## then call summary or coefs on that object ## response predictor estimate std.error p.value ## 1 DD lat -0.8354736 0.119422385 0 *** ## 2 Date DD -0.4976475 0.004933274 0 *** ## 3 Growth Date 0.3007147 0.026631405 0 *** ## 4 Live Growth 0.3478536 0.058407891 0 *** Note the warnings produced by calling the old functions. Now, we repeat the exercise using the new functions. as.psem converts the list to psem object, but you could also run the code using psem instead of list. shipley.psem <- as.psem(shipley.list) ### NOT RUN # shipley.psem <- psem( # # lme(DD ~ lat, random = ~ 1 | site / tree, na.action = na.omit, # data = shipley), # # lme(Date ~ DD, random = ~ 1 | site / tree, na.action = na.omit, # data = shipley), # # lme(Growth ~ Date, random = ~ 1 | site / tree, na.action = na.omit, # data = shipley), # # glmer(Live ~ Growth + (1 | site) + (1 | tree), # family = binomial(link = "logit"), data = shipley) # # ) Now we extract a summary of the object. (new.summary <- summary(shipley.psem, .progressBar = F)) ##$name
## [1] "shipley.psem"
##
## $call ## [1] "DD ~ lat\n Date ~ DD\n Growth ~ Date\n Live ~ Growth" ## ##$dTable
##         Independ.Claim Estimate Std.Error   DF Crit.Value P.Value
## 1   Date  ~  lat + ...  -0.0091    0.1135   18    -0.0798  0.9373
## 2 Growth  ~  lat + ...  -0.0989    0.1107   18    -0.8929  0.3837
## 3   Live  ~  lat + ...   0.0305    0.0297   NA     1.0281  0.3039
## 4  Growth  ~  DD + ...  -0.0106    0.0358 1329    -0.2967  0.7667
## 5    Live  ~  DD + ...   0.0272    0.0271   NA     1.0035  0.3156
## 6  Live  ~  Date + ...  -0.0466    0.0298   NA    -1.5622  0.1183
##
## $Cstat ## Fisher.C df P.Value ## 1 11.535 12 0.484 ## ##$IC
##      AIC   AICc     BIC  K    n
## 1 49.535 50.237 149.591 19 1431
##
## $coefficients ## Response Predictor Estimate Std.Error DF Crit.Value P.Value ## 1 DD (Intercept) 196.6524 7.6605 1331 25.6708 0.0000 ## 2 DD lat -0.8355 0.1194 18 -6.9960 0.0000 ## 3 Date (Intercept) 198.3382 1.1842 1330 167.4825 0.0000 ## 4 Date DD -0.4976 0.0049 1330 -100.8757 0.0000 ## 5 Growth (Intercept) 10.7892 3.5189 1330 3.0661 0.0022 ## 6 Growth Date 0.3007 0.0266 1330 11.2917 0.0000 ## 7 Live Growth 0.3479 0.0584 1431 5.9556 0.0000 ## Std.Estimate ## 1 161.8746 *** ## 2 -0.6877 *** ## 3 250.3448 *** ## 4 -0.6281 *** ## 5 13.7208 ** ## 6 0.3824 *** ## 7 NA *** ## ##$R2
##   Response   family     link method Marginal Conditional
## 1       DD gaussian identity   none     0.49        0.70
## 2     Date gaussian identity   none     0.41        0.98
## 3   Growth gaussian identity   none     0.11        0.84
## 4     Live binomial    logit  delta     0.16        0.18
##
## attr(,"class")
## [1] "summary.psem"

The output will look familiar to anyone who has run a regression in R.

First, we have the call, which represents the structural questions.

Next are the AIC and new BIC values, formerly accessible using sem.aic. The two produce identical values.

# Old function
sem.aic(shipley.list, shipley, .progressBar = F)$AIC ## [1] 49.53 # Extract from new summary object new.summary$IC$AIC ## [1] 49.535 ### NOT RUN # Alternately, one could call AIC() on the psem object # AIC(shipley.psem) Note that because the values are stored in the summary output, they are much quicker to access than having to recompute the d-sep tests using sem.aic. Following the information criterion values, we have the tests of directed separation. As in the previous version, the independence claims are truncated to remove the conditioning variables (they can be shown using the argument conditional = TRUE). The d-sep tests are, once again, identical between the two versions, but are faster to retrieve from summary than recomputing them from scratch. # Old function sem.missing.paths(shipley.list, shipley, .progressBar = F) ## Conditional variables have been omitted from output table for clarity (or use argument conditional = T) ## missing.path estimate std.error df crit.value p.value ## 1 Date ~ lat + ... -0.009051378 0.11347661 18 -0.07976426 0.9373049 ## 2 Growth ~ lat + ... -0.098862826 0.11072020 18 -0.89290690 0.3836896 ## 3 Live ~ lat + ... 0.030496786 0.02966350 NA 1.02809131 0.3039069 ## 4 Growth ~ DD + ... -0.010613707 0.03576722 1329 -0.29674399 0.7667083 ## 5 Live ~ DD + ... 0.027190372 0.02707564 NA 1.00423751 0.3152641 ## 6 Live ~ Date + ... -0.046565402 0.02982419 NA -1.56133000 0.1184459 # Extract from new summary object new.summary$dTable
##         Independ.Claim Estimate Std.Error   DF Crit.Value P.Value
## 1   Date  ~  lat + ...  -0.0091    0.1135   18    -0.0798  0.9373
## 2 Growth  ~  lat + ...  -0.0989    0.1107   18    -0.8929  0.3837
## 3   Live  ~  lat + ...   0.0305    0.0297   NA     1.0281  0.3039
## 4  Growth  ~  DD + ...  -0.0106    0.0358 1329    -0.2967  0.7667
## 5    Live  ~  DD + ...   0.0272    0.0271   NA     1.0035  0.3156
## 6  Live  ~  Date + ...  -0.0466    0.0298   NA    -1.5622  0.1183
### NOT RUN
# Alternately, one could call dSep() on the psem object
# dSep(shipley.psem)

After the d-sep table is the Fisher’s C statistic, and the results from the $$\chi^2$$ test. This information was formerly obtained from sem.fit.

# Old function
sem.fit(shipley.list, shipley, .progressBar = F)$Fisher.C ## Warning: sem.fit has been replaced. Use psem instead of list, and ## then call summary on that object ## Conditional variables have been omitted from output table for clarity (or use argument conditional = T) ## fisher.c df p.value ## 1 11.53 12 0.484 # Extract from summary object new.summary$Cstat
##   Fisher.C df P.Value
## 1   11.535 12   0.484

As with all other functions, the values are exactly the same although the underlying functions have been rewritten to be more efficient.

Next, we have the path coefficients and the standardized estimates.

# Old function
sem.coefs(shipley.list, shipley)
## Warning: sem.coefs has been replaced. Use psem instead of list, and
## then call summary or coefs on that object
##   response predictor   estimate   std.error p.value
## 1       DD       lat -0.8354736 0.119422385       0 ***
## 2     Date        DD -0.4976475 0.004933274       0 ***
## 3   Growth      Date  0.3007147 0.026631405       0 ***
## 4     Live    Growth  0.3478536 0.058407891       0 ***
sem.coefs(shipley.list, shipley, standardize = "scale")
## Warning: sem.coefs has been replaced. Use psem instead of list, and
## then call summary or coefs on that object
## Warning in get.scaled.data(modelList, data, standardize): One or more
## responses not modeled to a normal distribution: keeping response(s) on
## original scale!
##   response predictor   estimate   std.error p.value
## 1       DD       lat -0.7014051 0.100258794       0 ***
## 2     Date        DD -0.6281367 0.006226838       0 ***
## 3   Growth      Date  0.3824224 0.033867469       0 ***
## 4     Live    Growth  2.2268596 0.373932623       0 ***
# Extract from new summary object
new.summary\$coefficients
##   Response   Predictor Estimate Std.Error   DF Crit.Value P.Value
## 1       DD (Intercept) 196.6524    7.6605 1331    25.6708  0.0000
## 2       DD         lat  -0.8355    0.1194   18    -6.9960  0.0000
## 3     Date (Intercept) 198.3382    1.1842 1330   167.4825  0.0000
## 4     Date          DD  -0.4976    0.0049 1330  -100.8757  0.0000
## 5   Growth (Intercept)  10.7892    3.5189 1330     3.0661  0.0022
## 6   Growth        Date   0.3007    0.0266 1330    11.2917  0.0000
## 7     Live      Growth   0.3479    0.0584 1431     5.9556  0.0000
##   Std.Estimate
## 1     161.8746 ***
## 2      -0.6877 ***
## 3     250.3448 ***
## 4      -0.6281 ***
## 5      13.7208  **
## 6       0.3824 ***
## 7           NA ***
# The new coefs() function is much faster too
coefs(shipley.psem)
##   Response   Predictor Estimate Std.Error   DF Crit.Value P.Value
## 1       DD (Intercept) 196.6524    7.6605 1331    25.6708  0.0000
## 2       DD         lat  -0.8355    0.1194   18    -6.9960  0.0000
## 3     Date (Intercept) 198.3382    1.1842 1330   167.4825  0.0000
## 4     Date          DD  -0.4976    0.0049 1330  -100.8757  0.0000
## 5   Growth (Intercept)  10.7892    3.5189 1330     3.0661  0.0022
## 6   Growth        Date   0.3007    0.0266 1330    11.2917  0.0000
## 7     Live      Growth   0.3479    0.0584 1431     5.9556  0.0000
##   Std.Estimate
## 1     161.8746 ***
## 2      -0.6877 ***
## 3     250.3448 ***
## 4      -0.6281 ***
## 5      13.7208  **
## 6       0.3824 ***
## 7           NA ***

Note two items: the standardized estimates and their standard errors are now reported by default at the end of the coefficients table (column Std.Estimate). the standardized estimate for the logistic regression is different, based on new calculations (see Section 2.2)

For the moment, version 2.0 only reports the scaled estimates (based on the standard deviations of the response and predictor). Future versions will include a range standardization.

Finally, summary reports the individual model R2 values. These were formerly obtained using sem.model.fits which was confusing. I have not included the old calculations in the package as the new rsquared function includes new calculations. However, this information is useful and therefore is now reported alongside the d-sep tests, Fisher’s C, and path coefficients.

The function partial.resid has been replaced with partialResid although the syntax is the same. The plotting output, however, has been eliminated.

partialResid extracts the partial effects of y ~ x in a multiple regression y ~ x + Z. For example:

# Generate data
dat <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))

# Build model
model <- lm(y ~ x1 + x2, dat)

# Compute partial residuals of y ~ x1
yresid <- resid(lm(y ~ x2, dat))

xresid <- resid(lm(x1 ~ x2, dat))

# Use partialResid
presid <- partialResid(y ~ x1, model)

par(mfrow = c(1, 2))

plot(yresid, xresid)

plot(presid) # identical plot!

sem.lavaan has not yet been ported to version 2.0, and it may not, as there was some confusion how multi-level models were translated to the variance-covariance framework (hint: they weren’t, only the formulae were transferred).

sem.plot has also not yet been ported to version 2.0, since it was of limited utility. It may make an appearance in the future.

## 4. References

Shipley, Bill. “A new inferential test for path models based on directed acyclic graphs.” Structural Equation Modeling 7.2 (2000): 206-218.

Shipley, Bill. Cause and correlation in biology: a user’s guide to path analysis, structural equations and causal inference. Cambridge University Press, 2002.

Shipley, Bill. “Confirmatory path analysis in a generalized multilevel context.” Ecology 90.2 (2009): 363-368.

Shipley, Bill. “The AIC model selection method applied to path analytic models compared using a d-separation test.” Ecology 94.3 (2013): 560-564.