Striving for Sparsity: On Exact and Approximate Solutions in Regularized Structural Equation Models

Abstract Regularized structural equation models have gained considerable traction in the social sciences. They promise to reduce overfitting by focusing on out-of-sample predictions and sparsity. To this end, a set of increasingly constrained models is fitted to the data. Subsequently, one of the models is selected, usually by means of information criteria. Current implementations of regularized structural equation models differ in their optimizers: Some use general purpose optimizers whereas others use specialized optimization routines. While both approaches often perform similarly, we show that they can produce very different results. We argue that in particular, the interaction between optimizer and selection criterion (e.g., BIC) contributes to these differences. We substantiate our arguments with an empirical demonstration and a simulation study. Based on these findings, we conclude that researchers should consider specialized optimizers whenever possible. To facilitate the implementation of such optimizers, we provide the R package lessSEM.

Structural equation modeling (SEM) is a confirmatory framework, where a model is first specified and then fitted to the data set.To this end, knowledge about each and every effect, (co-)variance, and intercept included in the model is required (e.g., based on theoretical considerations and prior analyses).Often, however, little is known about the underlying mechanisms and structure of the psychological constructs under investigation.Even in well-established fields, purely confirmatory modeling can fail to adequately reflect the data due to the rigidity of the underlying framework (e.g., Marsh et al., 2010).
Regularized SEM has been proposed as a compromise between the confirmatory SEM and exploratory model search strategies (Huang et al., 2017;Jacobucci et al., 2016).This "semi-confirmatory approach" (Huang et al., 2017, p. 330) allows for more flexibility during data analysis, while still operating in a framework familiar to many social scientists.Regularized SEM has a strong focus on the predictive performance and sparsity of the model (Huang et al., 2017;Jacobucci et al., 2016).To this end, selected parameter estimates are shrunken toward zero (Tibshirani, 1996).This shrinkage restricts the degree to which a model is allowed to adapt to the data set and can improve its predictive performance if applied carefully (e.g., Huang et al., 2017;Jacobucci et al., 2016;Tibshirani, 1996).Additionally, many regularization techniques can set parameter estimates to zero, resulting in sparser models which may simplify the interpretation considerably (e.g., Fan & Li, 2001;Tibshirani, 1996;Zhang, 2010;Zou, 2006).The aforementioned properties make a compelling argument for the use of regularized SEM, which has consequently gained a foothold in the social sciences (e.g., Davis et al., 2019;Epskamp et al., 2017;Finch & Miller, 2021;Friemelt et al., 2022;G ongora et al., 2020;Jacobucci et al., 2019;Jacobucci & Grimm, 2018;Koncz et al., 2022;Scharf & Nestler, 2019;Ye et al., 2021).
It is important to understand the roots of such differences, both theoretically (e.g., when implementing regularization methods for other models) and practically (e.g., when choosing between different implementations of the same methods).Huang (2018) attributes some differences to the optimizers used by both packages.While the default setting in regsem is to rely on a general purpose optimizer that is not adapted to the specific penalty functions, Huang (2020a) implemented specialized optimizers (regsem optionally provides a specialized optimizer as well).For lasso regularization, Huang (2018) states that the method by Jacobucci (2017) "cannot efficiently find a local maximizer, especially when the model is relatively complex" (Huang, 2018, p. 500).Likewise, Orzek and Voelkle (2023) observed differences between general purpose optimizers and specialized optimizers in lasso regularized continuous time SEM and therefore used specialized optimizers.In contrast, Ridler (2022) used the general purpose optimizer after observing differences between both types of optimizers implemented in regsem based on the recommendation provided in the package.Furthermore, Belzak and Bauer (2020) used general purpose optimizers to assess differential item functioning in item response theory models and found that their implementation performed similarly to specialized optimizers (see also Belzak, 2021).Finally, Battauz (2020) found that a general purpose optimizer resulted in a better fit than specialized optimizers in regularized nominal response models.It, therefore, remains unclear if the finding by Orzek and Voelkle (2023) generalizes to the "standard" regularized SEM, where general purpose optimizers are already routinely used-a question that we set out to answer in this work.
In the following, we will first provide an introduction to regularized SEM, focusing specifically on the issue of optimization.Building on Orzek and Voelkle (2023), we will demonstrate the consequences of using general purpose optimizers with an empirical example and provide new insights with a simulation study.Our results indicate that differences in the optimization procedures could partially explain the performance discrepancies observed between regsem and lslx in the literature (e.g., Finch & Miller, 2021;Friemelt et al., 2022;Geminiani et al., 2021;Huang, 2020a).We conclude that researchers using regularized SEM should prefer specialized optimizers where available.To facilitate the development of regularized SEM we provide the new R package lessSEM.

The Two Steps of Regularized Structural Equation Modeling
At its core, regularized SEM is a two-step procedure: In the first step, a set of increasingly sparse models is estimated.This is followed by the second step, where one of the models is selected (e.g., Huang et al., 2017;Jacobucci et al., 2016;Tibshirani, 1996).Building on an extensive background of research on regularization techniques in machine learning (e.g., Fan & Li, 2001;Tibshirani, 1996;Zhang, 2010;Zou, 2006), different strategies for both steps, estimating increasingly sparse models and selecting one of these models, have been proposed in the regularized SEM framework (e.g., Huang et al., 2017;Jacobucci et al., 2016;Li & Jacobucci, 2021).

Step One: Estimating Increasingly Sparse Models
In regularized SEM, a set of increasingly sparse models is estimated by slightly altering the fitting function.This function is minimized when estimating the parameters h: We will restrict the discussion to the commonly used À2 log-likelihood fitting function F ML ðhÞ: Furthermore, we will assume that the likelihood itself is given by the multivariate normal density function, the default in most SEM applications.In regularized SEM, F ML ðhÞ is combined with a penalty term pðhÞ (Huang et al., 2017;Jacobucci et al., 2016): Finding the parameter estimates ĥk , which minimize F reg, k ðhÞ, requires balancing the two forces in the new fitting function.This can be thought of as a tug-of-war, where F ML ðhÞ ties to pull h to the maximum likelihood estimates ĥML and pðhÞ pulls selected parameters toward zero.The tuning parameter k !0 allows for shifting the balance between these two forces: If k is close to zero, the maximum likelihood part F ML ðhÞ will dominate the fitting function and parameters will be close to ĥML : In contrast, if k is very large, the penalty term pðhÞ will take over the fitting function and parameters will be close to zero (this is demonstrated in Figure 1 for different penalty functions pðhÞ).We can conceive of k as the ratio of players per team in our tug-of-war game which allows us to give one team an advantage over the other.Finally, the penalty term in Equation ( 1) is multiplied with sample size N to stay consistent with current implementations of regularized SEM in regsem and lslx.Importantly, the parameter k itself is not estimated alongside the other parameters in h: Instead, researchers must define a set of k values (e.g., k 2 f0, :01, :::, 1g) and fit a separate model for each of them (e.g., Jacobucci et al., 2016).Thus, the number of k values dictates how many models are estimated (see Huang et al., 2017;Jacobucci et al., 2016, for more details).
Many different penalty functions pðhÞ have been proposed, such as the ridge (Hoerl & Kennard, 1970), lasso (Tibshirani, 1996), adaptive lasso (Zou, 2006), elastic net (Zou & Hastie, 2005), scad (Fan & Li, 2001), lsp (Cand es et al., 2008), cappedL1 (Zhang, 2010), and mcp (Zhang, 2010).Most importantly, the different penalty functions exhibit distinct properties.For instance, ridge regularization can only shrink parameters toward zero, while the other penalties mentioned above can shrink parameters and set them to zero; that is, they perform parameter selection (e.g., Tibshirani, 1996, see Figure 1 for a visualization of different penalty functions).In SEM this can, for instance, be used to identify non-zero cross-loadings (Scharf & Nestler, 2019) or to simplify multiple-indicator-multiple-causes models (Jacobucci et al., 2019).Most of the penalty functions mentioned above are implemented in the R packages regsem (Jacobucci, 2017) or lslx (Huang, 2020a).For simplicity, we will focus on lasso regularization in the following.However, we expect the results to be similar for all other penalties which use the absolute value function (e.g., adaptive lasso, elastic net, scad, lsp, cappedL1, and mcp).The lasso is defined as pðhÞ ¼ X J j¼1 r j jh j j: (2) Here, J is the number of parameters in h and r j is a selection operator.If r j ¼ 1, the parameter h j is regularized and if r j ¼ 0 the parameter h j remains unregularized.This allows for regularizing only a subset of the parameters, for instance, only the loadings.Finally, j Á j denotes the absolute value function.

Step Two: Selecting the Best Model
Different values of k can result in vastly different parameter estimates (e.g., see Figure 1).This property was used in step one to estimate a set of increasingly sparse models.In step two, one of these models must be selected (i.e., an appropriate value for k must be found).Too little regularization may result in an overly complex model that still overfits the data and generalizes poorly.Too much regularization, in contrast, may result in an overly sparse model that underfits the data.Different procedures for selecting k have been proposed (e.g., Tibshirani, 1996).The most prominent in regularized SEM are cross-validation and information criteria (Huang et al., 2017;Jacobucci et al., 2016;Li et al., 2021).
Cross-validation provides a direct way to measure how well a model generalizes to data sets that come from the same population but have not been used to estimate the model parameters (e.g., Browne, 2000).To this end, the models are fitted on one part of the data set, using the remaining part to evaluate the out of sample fit.In k-fold cross-validation, the data set is split in k subsets.Subsequently, each subset serves as a test set while fitting the model using the left out k À 1 subsets (a general introduction to cross-validation is, for instance, provided by James et al., 2013).The main drawbacks of cross-validation are that (1) refitting the model on subsets of the data may result in non-convergence and worse parameter estimates due to the smaller sample size, and (2) depending on the complexity of the model, refitting the model may take a relatively long time.Therefore, cross-validated regularized SEM is rarely used in practice.
Information criteria are an alternative to cross-validation that do not require any refitting (asymptotically, information criteria and cross-validation are closely related; Shao, 1997;Stone, 1977).The Bayesian information criterion (BIC; Schwarz, 1978), for example, is defined as Smaller BIC values indicate better fit and can be achieved by either reducing F ML ðhÞ or by reducing the number of model parameters t.The latter can be seen as the BIC rewarding model sparsity.In regularized SEM, the BIC and similar information criteria are typically used to select the final parameters if the penalty can set them to zero and therefore reduce t (Huang et al., 2017;Jacobucci et al., 2016).Importantly, information criteria are currently the standard model selection procedure in regsem and lslx and are used in many recent publications (e.g., Brandmaier & Jacobucci, in press;Davis et al., 2019;Finch & Miller, 2021;G ongora et al., 2020;Huang et al., 2017;Jacobucci et al., 2019;Koncz et al., 2022;Li et al., 2021;Scharf & Nestler, 2019;Ye et al., 2021).However, as we will discuss in detail in the next section, combining information criteria with general purpose optimizers requires two additional parameters that determine which model gets selected (see also Orzek & Voelkle, 2023).

General Purpose and Specialized Optimizers
While adding the penalty function pðhÞ to F ML ðhÞ seems innocuous, it can have severe consequences for parameter estimation.Uncovering these consequences requires a closer look at optimization algorithms (see also Belzak, 2021;Orzek & Voelkle, 2023).
To estimate the model parameters ĥk , F reg, k ðhÞ must be minimized.What sounds like a simple procedure is, in fact, a tedious process with numerous pitfalls.For the models under consideration here, there is (in general) no direct (i.e., closed form) solution to minimizing F reg, k ðhÞ: Instead, an iterative procedure is used which replaces the difficult to answer question "What is the minimum of F reg, k ðhÞ?" with a simpler one: "Is there a parameter vector h kþ1 close to the current point h k that results in a smaller F reg, k ðhÞ?"This question is asked repeatedly until no further improvement to h k can be made.Many different procedures that formalize and answer this surrogate question have been proposed, which differ in the functions they can minimize (see Nocedal & Wright, 2006, for an introduction).To better understand why F reg, k ðhÞ is a special problem, we will concentrate on socalled Newton-type optimization algorithms.Here, in each iteration k, one tries to take a step of length s k in a downward direction d k : The new parameter vector h kþ1 is then given by: A visual representation of this approach for a single parameter h and fitting function f ðhÞ is provided in Figure 2. Note that we get closer to the minimum of the fitting function with each step.Newton-type optimizers typically use the gradients and the Hessian of the fitting function to determine the step direction d k (e.g., Nocedal & Wright, 2006). 2 Therefore, they assume that both, the gradients and the Hessian, can be computed (this assumption is often stated as "f : R n !R is a smooth function" or "f : R n !R is continuously differentiable"; Nocedal & Wright, 2006, p. 10 and p. 14) For F reg, k ðhÞ this leaves us with a problem: Whereas gradients and the Hessian for F ML ðhÞ are typically well defined (e.g., von Oertzen & Brick, 2013), this is not the case for any of the penalty functions introduced above, with the exception of ridge, which does not set parameters to zero. Figure 3 exemplarily shows the ridge and lasso penalty values for a single parameter.Note that the ridge penalty is smooth (or differentiable) whereas the lasso is non-smooth (or non-differentiable) at h ¼ 0. To demonstrate the nondifferentiability of the lasso mathematically, we can rewrite the absolute value as jh j j ¼ ffiffiffiffi ffi h 2 j q , where ffi ffi Á p is the positive square root.Differentiating with respect to h j , we get: if h j < 0, and h j ffiffiffi ffi The non-differentiability is essential for the parameter selection property of the lasso and related regularization procedures (e.g., Fan & Li, 2001): The steepness of the curves shown in Figure 3 corresponds to the force with which they push parameters to zero (Andrew & Gao, 2007).For the ridge penalty, the steepness decreases as h approaches zero.As a consequence, any small force pushing h away from zero (F ML ðhÞ in our case) will overrule the ridge penalty and parameters will stay non-zero.Contrast this with the lasso penalty, which retains its steepness.This allows the lasso to relentlessly push h to zero, overruling any opposing force by F ML ðhÞ (Andrew & Gao, 2007).
As outlined above, the non-differentiability of the lasso penalty violates the assumptions underlying many common optimizers.In the following, we will present two ways of dealing with this issue: First, by using general purpose optimizers, and second by using specialized optimizers (e.g., Belzak, 2021;Orzek & Voelkle, 2023;Schmidt et al., 2009).

General Purpose Optimizers
One way of dealing with the non-differentiability of the lasso is to ignore it.As the lasso is differentiable almost everywhere (with h j ¼ 0 being the only exception), optimizers that require differentiability will often converge to a solution close to the true minimum.Formally, this approach violates the assumptions of the optimizer, however.The ridge is differentiable, whereas the lasso is not. 2 The gradient is a vector of first derivatives of the fitting function with respect to the parameters and the Hessian is a matrix with second derivatives.
Mathematically, the derivative is now given by p and the division by zero is avoided.
Therefore, the smoothness-assumption underlying many general purpose optimizers is met.
From an optimization perspective, the smoothed approximation has one major advantage: Many optimizers rely on the gradients to judge the convergence (if all gradients are close enough to zero, the optimization stops).The gradients of the lasso penalty, however, are either À1, 1, or invalid.The optimizer may therefore miss the minimum, which results in long run times (small can also result in an illconditioned Hessian; Schmidt et al., 2009).In the smoothed lasso, in contrast, the gradient is zero at h ¼ 0, allowing for the standard convergence criteria to work well (assuming is large enough).
The beauty of either ignoring the non-differentiability or smoothing the lasso penalty function lies in its simplicity.Building on existing optimizers implies that the algorithms used to find the parameter estimates are wellestablished and reliable.A downside, however, is that none of the parameter estimates will be zeroed (e.g., Chamroukhi & Huynh, 2019;Muggeo, 2010;Orzek & Voelkle, 2023): When ignoring the non-differentiability, this is just a side-product of using an optimizer which is not well suited for the optimization problem.In contrast, a smoothed lasso will no longer have the force to push parameters all the way to zero given any opposing forcethis is similar to the ridge penalty.To achieve sparsity, a threshold parameter s can be defined (e.g.s ¼ 0:001).If the absolute value of a regularized parameter falls below the threshold, this parameter is treated as zeroed (e.g., Belzak & Bauer, 2020;Chamroukhi & Huynh, 2019;Epskamp et al., 2017;Orzek & Voelkle, 2023).Because none of the parameters can be zeroed, we will refer to the parameter estimates provided by both approaches outlined above as approximate solutions.

Specialized Optimizers
Many specialized optimizers have been developed specifically for penalty functions, such as the lasso (e.g., Andrew & Gao, 2007;Friedman et al., 2010;Gong et al., 2013;Huang et al., 2017;Yuan et al., 2012).Thorough introductions are, for instance, provided by Goldstein et al. (2016), Parikh andBoyd (2013), andYuan et al. (2010).For our objectives, the most important difference between the general purpose optimizers presented in the previous section and the specialized optimizers discussed here is that specialized optimizers can account for the non-differentiability of the penalty function.As a result, they can push parameters all the way to zero; that is, they will provide what we will call exact solutions to the optimization problem.
To demonstrate how an exact solution can be found, we will make use of a simplified objective function.Assume that we want to minimize the lasso regularized function f ðhÞ ¼ 0:3h þ 0:5h 2 þ kjhj: We first take the derivative with respect to h and set it to zero: @ @h f ðhÞ ¼ As outlined above, the derivative @ @h jhj may be undefined.Therefore, we typically resort to subderivatives, a generalization of derivatives (e.g., Hastie et al., 2015).For our purposes, however, we can use the fact that (1) the function we try to minimize has one and only one minimum (this follows from the function being strictly convex; see also Nocedal & Wright, 2006) and (2) as derived above @ @h jhj ¼ À1 if h < 0 and @ @h jhj ¼ 1 if h > 0: With this in mind, we are ready to readdress our original problem: @ @h f ðhÞ ¼ @h jhj¼ set 0: Let's assume that h is smaller than zero.In this case, we know that @ @h jhj ¼ À1 and it follows that h ¼ À0:3 þ k minimizes f ðhÞ: Note that we can rewrite our assumption h < 0 as h ¼ À0:3 þ k < 0: Next, assume that h > 0: It follows that @ @h jhj ¼ 1 and the h minimizing f ðhÞ is given by h ¼ À0:3 À k: Again, we can rewrite the assumption h > 0 as h ¼ À0:3 À k > 0: Combined, we get The last case follows from statement (1) above: The function has one and only one minimum.If this minimum is neither in h < 0 nor in h > 0, this just leaves h ¼ 0 as minimum.This last case is crucial because it lets us set h to exactly zero.With this, we have developed a specialized optimizer that provides an exact solution to our simplified objective function.Optimizers for more complex models can be developed with a similar procedure.
The main disadvantage of specialized optimizers is that they have to be adapted to each and every penalty function.The solution for our simplified objective function above, for example, only works for the lasso penalty.Deriving the optimization steps for a new penalty function can be complicated (e.g., Gong et al., 2013), considerably increasing the effort required to get exact solutions.Thus, one may rightfully wonder if specialized optimizers are, in fact, necessary.We will return to this important question later on.

Information Criteria for Model Selection: Close to Zero Is Not Zero
As discussed above, the BIC (and similar information criteria) rewards model sparsity by penalizing with respect to the number of parameters t in the model.As apparent from Equation (3), the only way for regularization to improve the BIC (as currently used in lasso regularized SEM) is by zeroing parameters.Importantly, a parameter value of 10 À100 is seen as qualitatively different from a parameter value of 0 because only in the latter case, the number of free parameters t is reduced by one (see Equation 3).This results in the "ragged" BIC pattern shown in Figure 5. Issues arise, however, when using approximate solutions generated by general purpose optimizers, as none of the parameters is exactly zero.To address this shortcoming, the threshold parameter s can also be applied for model selection: Any regularized parameter with an absolute value below threshold s is counted as zeroed in the BIC (e.g., Belzak & Bauer, 2020;Epskamp et al., 2017;Muggeo, 2010;Orzek & Voelkle, 2023).Consequently, threshold parameter s directly affects model sparsity (e.g., Chamroukhi & Huynh, 2019; see also Zou & Li, 2008), information criteria, and, therefore, also model selection (e.g., Orzek & Voelkle, 2023).With this in mind, we can readdress the findings of Orzek and Voelkle (2023) and Belzak and Bauer (2020) mentioned in the introduction: Orzek and Voelkle (2023) found that regularized continuous time SEM estimated with the general purpose optimizer implemented in Rsolnp (Ghalanos & Theussl, 2015;Ye, 1987, the same optimizer is used in regsem by default) can differ from those estimated with the specialized optimizer GIST (Gong et al., 2013).This was due to the Akaike information criterion being highly affected by the threshold parameter s, resulting in different models being selected when the general purpose optimizer was used (smoothing parameter was set to 0.001 for all analyses).Belzak and Bauer (2020) combined a general purpose optimizer with a thresholding parameter of s ¼ 0:00001 to investigate differential item functioning in item response theory models and report that their method performed similarly to existing procedures using specialized optimizers.This was studied in more detail by Belzak (2021) who found that a general purpose and a specialized optimizer produced mostly identical results, with some differences in the number of zeroed parameters.However, to the best of our understanding, they did not vary the value of the smoothing parameter or threshold parameter s.
Given that general purpose optimizers are already routinely used in regularized SEM, it is important to test the degree to which these models may be affected as well.To this end, we will build on and extend the findings of Orzek and Voelkle (2023) to regularized SEM by (1) using an empirical example and a simulation study, (2) two very similar optimizers, and (3) different values for both, s and .

lessSEM-A Flexible Regularized SEM Package
At the time of writing, the two most prominent R packages implementing regularized SEM are regsem (Jacobucci, 2017) and lslx (Huang, 2020a), with additional packages (e.g., regCtsem; Orzek & Voelkle, 2023 and lvnet; Epskamp et al., 2017) available for more specialized applications.By default, regsem relies on general purpose optimization with Rsolnp (Ghalanos & Theussl, 2015;Ye, 1987), setting smoothing parameter to zero and threshold parameter s to 10 À3 :3 In addition, a specialized optimizer using coordinate descent is optionally available.Other software implementations, such as lvnet (interfacing to OpenMx; Neale et al., 2016) and regCtsem also offer general purpose optimization.In contrast, lslx offers a specialized quasi-Newton optimizer, that takes the non-differentiability into account (Huang, 2020a).This optimizer was originally developed by Friedman et al. (2010) and Yuan et al. (2012) and a variant thereof is also implemented in the prominent glmnet package (Friedman et al., 2010) and the regCtsem package (Orzek & Voelkle, 2023) alongside a proximal operator based procedure (see Gong et al., 2013).
Given the many differences between the packages mentioned above, a direct comparison of general purpose optimizers and specialized optimizers is difficult.Comparing regsem and lslx, for instance, would conflate differences between optimizers and differences in the implementation of the underlying model.Using the general purpose and specialized optimizers in regsem would base the comparison on two fairly different optimization routines.To provide a fair comparison, we created a new R package, lessSEM, using two quasi-Newton procedures. 4The general purpose optimizer is based on the so-called BFGS procedure, which we describe in more detail in Appendix A (this optimizer is similar to the BFGS procedure in the optim package; R Core Team, 2022).As specialized optimizer, we implemented the GLMNET procedure (Friedman et al., 2010;Huang, 2020a;Yuan et al., 2012)  We compare our implementation with lslx in the osf repository at https://osf.io/kh9tr/.
Importantly, the objectives of lessSEM are not to replace the existing more mature packages regsem and lslx but to (1) provide a more level playing field for the comparison of general purpose and specialized optimizers in regularized SEM and (2) to make specialized optimizers for exact solutions more easily available.To this end, all optimizers implemented in lessSEM can be accessed both from R and Cþþ using Rcpp (Eddelbuettel & Franc ¸ois, 2011) and RcppArmadillo (Eddelbuettel & Sanderson, 2014), simplifying the development of new regularized models (see Appendix C for an example). 6For instance, lessSEM could also be used by regsem which currently does not support the GLMNET procedure.In addition (3), lessSEM provides automatic cross-validation, full information maximum likelihood estimation in case of missing data, and multi-group models (see also Huang, 2018).Furthermore, lessSEM extends the functionality of existing packages, for example, by providing definition variables, parameter transformations inspired by OpenMx (Neale et al., 2016), and combinations of different penalties (e.g., for use in multi-group models; see Geminiani et al., 2021).Therefore, method developers can use lessSEM as a basis for creating new regularized SEM.For instance, regularized continuous time SEM can also be implemented in lessSEM despite the fairly involved transformations required to estimate such models (see Voelkle et al., 2012, for more details).

Empirical Demonstration
To demonstrate differences that can occur between general purpose optimization and specialized optimization, we used the bfi data set provided in the psychTools package (Revelle, 2022;Revelle et al., 2010).This data set consists of 2,800 subjects who answered 25 items measuring five personality factors: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, and was also analyzed by Huang (2020b).Following Huang (2020b), we used only complete cases (N ¼ 2,436) and estimated a semi-confirmatory factor analysis, where all cross-loadings were regularized (see Figure 6).We used lasso regularization with 200 evenly spaced k values between 0 and the first k that sets all regularized parameters to zero (this value was computed with the procedure outlined in Orzek & Voelkle, 2023).The convergence threshold was set to 10 À12 for both, the general purpose and the specialized optimizer.For the general purpose optimizer, we set to 10 À8 as this value performed well in the simulation study (see next section).Finally, because regularized SEM is often proposed for situations where the model is relatively complex but N is small (e.g., Jacobucci et al., 2019), we repeated the analysis using randomly selected subsets of size N 2 f250, 500, 1000, 2000, 2436g: We standardized all data sets prior to the analysis (see Jacobucci et al., 2016, for more details on standardization).In the following, we will only report the results for N ¼ 500.The remaining results are reported in the osf repository at https://osf.io/kh9tr/.
Figure 7 shows the BIC as well as the number of parameters remaining in the model.Note that, consistent with findings by Orzek and Voelkle (2023) in continuous time SEM, the model selection is highly susceptible to the thresholding parameter s.That is, depending on the optimization routine and threshold parameter setting used, researchers may come to different conclusions regarding the number of non-zero parameters.In Appendix D, we additionally provide parameter estimates for the loadings matrices of the specialized optimizer and three selected approximate solutions of the general purpose optimizer.Therein, we also present a comparison to the approximate solutions returned by the current version of regsem (package version 1.9.3) using the default general purpose optimizer.Similar to the results shown in Figure 7, Figure D3 in Appendix D shows that the threshold parameter s has a considerable influence on the parameters returned by the regsem package.
Our empirical example provided the first evidence that the threshold parameter s can affect the final parameter estimates in regularized SEM, both when using the newly developed package lessSEM as well as when using the established package regsem.In the next section, we will systematize these findings in a simulation study.

Simulation Study
To systematically compare general purpose optimizers and specialized optimizers, we ran a simulation study where we used lasso regularization to identify non-zero cross-loadings in a confirmatory factor analysis (see also Huang et al., 2017;Scharf & Nestler, 2019).We used a three factor CFA H e ¼ diagð0:115, 0:464, 0:572, 0:345, 0:411, 0:122, 0:536, 0:68, 0:383, 0:627, 0:123, 0:491, 0:546, 0:518, 0:598Þ (6) Equation ( 4) shows the measurement model, where we assumed that the core of the three latent factors (n 1 , n 2 , n 3 ) is reflected by five items each.Loadings on these items are underlined in Equation (4) and will be referred to as coreloadings in the following.Among these core-loadings, one was set to 1; this loading was used for scaling the latent variables in the analysis. 8All other core-loadings were drawn from the uniform distribution U ½0:7, 0:9 : Additionally, we allowed for three non-zero cross-loadings per latent factor.The location of these cross-loadings was randomly selected and their values drawn from U ½0:1, 0:3 : The variances of the three factors was set to one (see Equation 5; the diag ðÁÞ operator populates the diagonal of a square zero matrix with the embraced values) and the variances of the residuals was set such that the overall-variance of each manifest variable was 1.2 (see Equation 6).Because all items have the same expected variance no further standardization is required.
We simulated five hundred data sets with a sample size of N ¼ 250 and fitted unregularized and regularized models with R (R Core Team, 2022) using lavaan (Rosseel, 2012) and lessSEM.We estimated both core-and cross-loadings freely with the exception of the first core-loading of each latent variable which we set to 1 for scaling.All cross-loadings were regularized with the lasso penalty.For the tuning parameter k, we generated 200 evenly spaced values between 0 and the first k that zeroed all regularized parameters.In case of the general purpose optimizer, we either ignored the non-differentiability or smoothed the lasso penalty with 2 f10 À8 , 10 À6 , 10 À4 , 10 À2 g and estimated the parameters with the BFGS procedure outlined in Appendix A. We stopped the optimization if the GLMNET criterion outlined in Yuan et al. (2012, see Equation 25) was met for both, general purpose and specialized optimization.The only exception was the combination of general purpose optimizer and ignored non-differentiability, where we treated the model as converged if the change in fit from one iteration to the next fell below the threshold 10 À6 : This exception was made due to the long runtimes when using the gradient-based GLMNET criterion.Average run times were between 4.4 (general purpose optimization with ¼ 10 À2 ) and 11.8 s (general purpose optimization when ignoring the non-differentiability).In the following, we will compare the results of the general purpose optimizer and the specialized optimizer for the two steps of estimating a regularized SEM separately.This will shed light on the roots of the differences observed in the empirical example.

Differences in Step One: Estimating Increasingly Sparse Models
We first checked if both optimization routines-the general purpose optimizer and the specialized optimizer-performed The blue line is the exact solution (with the specialized optimizer GLMNET).Diamonds represent the respective selected models. 7 Files to replicate the simulations and the empirical example are provided in the osf repository at https://osf.io/kh9tr/.All analyses were performed using a single core of an Intel V R Xeon V R E5-2670 processor. 8 We chose this way of scaling because scaling by setting the variances to 1 would allow for multiple equivalent solutions with reversed signs of loadings.
similarly in terms of minimizing F reg, k ðhÞ: To this end, we defined the relative fit (RF) for repetition r 2 f1, 2, :::, 500g as where F reg, k, r ðÁÞ is the value of the regularized fitting function in repetition r for a given k.The vector ĥa k, r, holds approximate parameter estimates returned by the general purpose optimizer for a given smoothing parameter .Similarly, ĥe k, r is a vector with exact parameter estimates produced by the specialized optimizer.A value of 1.02, for example, indicates that the approximate solution of the general purpose optimizer resulted in a 2% larger (i.e., worse) fitting function value than the exact solution of the specialized optimizer.Values below 1, in contrast, indicate that the general purpose optimizer outperformed the specialized optimizer.
Additionally, we computed the sum of the squared differences (SSD) between the parameter estimates of the general purpose optimizer and the specialized optimizer.The SSD was defined as: Larger values indicate larger differences between the general purpose optimizer and the specialized optimizer.
Figure 8 shows the relative fit of the approximate solutions with the general purpose optimizer as compared to the exact solutions of the specialized optimizer.Note that the specialized optimizer almost always resulted in a better fit than the general purpose optimizer.However, the differences were negligible in many cases.Ignoring the non-differentiability resulted in a worse fit, which may have been due to issues with the gradient computation.Furthermore, smoothing the penalty function worked better with smaller values.
Figure 9 shows the SSD.Notably, the differences in the fitting function values are also reflected in differences in the parameter estimates.To provide additional insights into the roots of the differences observed in Figure 9, we plotted the regularized parameter estimates returned by the general purpose optimizer and the specialized optimizer for the first of the 500 repetitions in Figure 10.Note that (1) larger values result in slower shrinkage toward zero and (2) ignoring the non-differentiability results in more "ragged" paths, indicating that the optimizer may have had difficulties finding a minimum.
Overall, our results suggest that approximate solutions based on general purpose optimizers with smoothed penalty  functions and exact solutions using a specialized optimizer perform similarly in terms of model fit and SSD if is set carefully.Importantly, however, model selection is not yet taken into account in these comparisons.In practice, only the parameter estimates of the final model (i.e., a specific k value) are of relevance.For this reason, we now turn to model selection and repeat the comparison for the final model as selected by the BIC.

Differences in Step Two: Selecting the Best Model
To test the model selection with the BIC, we set the threshold parameter s to s 2 f10 À8 , 10 À6 , 10 À5 , 10 À4 , 5 Â 10 À3 , 10 À3 , 10 À2 g: 9 Additionally, we also used 5-fold cross-validation instead of the BIC to select the final model.Because cross-validation does not require zeroed parameters to improve, no threshold parameter s is needed (s was introduced because the BIC only improves if some parameters are set to exactly zero).
The selected models were again compared to the exact solution provided by the specialized optimizer using the SSD defined in Equation ( 9).SSD r, e, s ¼ ĥa r, e, s À ĥe Here, ĥa r, , s and ĥe r are the final parameter estimates for the general purpose and specialized optimizer, respectively.In case of the general purpose optimizer, the final parameter estimate depends on both smoothing parameter and threshold parameter s.Again, larger SSD indicate larger differences between the approximate optimizer and the specialized optimizer.
Because model sparsity is often a motivation for using LASSO regularization, we also investigated the differences in the number of parameters t remaining in the model (Dt r, , s ): Here, J is the length of parameter vector h, 1ðÁÞ is an indicator function that returns one if the embraced condition is met and zero otherwise, and ĥa r, , s, j is the j-th element of vector ĥa r, , s : A Dt r, , s of one, for instance, indicates that the approximate solution of the general purpose optimizer has one parameter more than the exact solution of the specialized optimizer.That is, researchers may come to different conclusions depending on which of the two optimization procedures they used.The difference Dt r, , s was not computed in case of cross-validation as cross-validation does not use s and therefore none of the parameters will be set to zero in case of general purpose optimization.
Figure 11 shows the SSD of the final parameter estimates of the approximate solutions provided by the general purpose optimizer compared to the exact solutions provided by the specialized optimizer across all 500 simulation runs when using the BIC to select the final models.As expected, s clearly affected the model selection.If s is very small, the selected approximate solution was often the maximum likelihood estimates ĥML : This is because the regularized estimates rarely fell below s and therefore no improvement in the BIC was observed.Moreover, s also interacted with in the smoothed penalty functions such that, for instance, s ¼ 5 Â 10 À4 worked well when ¼ 10 À8 , but not with ¼ 10 À4 : Smaller values typically resulted in regularized estimates which are closer to zero and thus more likely to fall below a small threshold s (this follows from Figure 10, where larger resulted in slower convergence to zero).Ignoring the non-differentiability performed worse than smoothing the penalty function with a small .
Figure 12 shows the SSD when using 5-fold cross-validation to select the final model.The model selection was considerably more consistent between both general purpose optimization and specialized optimization.However, none of the parameters is exactly zero in case of the general purpose optimization, which may impact the interpretation of the model (e.g., Orzek & Voelkle, 2023).
Figure 13 shows the difference in the number of parameters remaining in the model (Dt r, , s ) when using the BIC to select the final model.Importantly, general purpose optimization could result in a very different number of parameters, even in the best settings tested here.As non-zeroed parameter estimates are considered to be important (e.g., Jacobucci et al., 2019), this may also affect the conclusions drawn from the final model.
Overall, our results show that relatively small differences in the two additional tuning parameters-and s-can alter the results substantially.For us, the only way to tell which combination works best was to implement both, a general purpose optimizer and a specialized optimizer.For the BFGS procedure used here, the combination of ¼ 10 À8 and s ¼ 5 Â 10 À4 performed fairly well.

Discussion
The objective of this work was to systematically test the differences between general purpose optimizers and specialized optimizers as currently used in regularized SEM in order to shed light on differences observed between regsem and lslx.To this end, we compared both, the fit and the parameter estimates of regularized SEM with two similar optimization strategies: A quasi-Newton optimizer for smooth fitting functions (BFGS) and a quasi-Newton optimizer for nonsmooth fitting functions (GLMNET).Ignoring model selection, we found that (1) differences in the fit and the parameter estimates between both optimization procedures are often negligible.However, (2) when combined with information criteria, as currently used in regsem, even small differences can result in unstable model selection.Only if both tuning parameters of general purpose optimizers-the smoothing parameter and the threshold parameter s-are well selected can we expect similar results.  .Sum of squared differences between exact solutions (using GLMNET) and approximate solutions (using BFGS) in the final parameter estimates when using 5-fold cross-validation for model selection.
Our results are consistent with Orzek and Voelkle (2023), who found that the threshold parameter s is critical in regularized continuous-time SEM.Moreover, we revealed an interaction of and s: Specific combinations for and s may exist for which the differences between approximate and exact solutions are small.In practice, however, these settings are unknown and, as our simulation demonstrated, may fail for any given data set.Importantly, the combination that worked best for the general purpose optimizer used in our simulations may not generalize to, for instance, regsem, where SEMs are implemented differently and other optimizers are used.Furthermore, our simulations showed that ignoring the nondifferentiability can result in worse parameter estimates than a smoothly approximated penalty.Overall, this suggests that researchers using approximate solutions should be cautious when selecting models by means of information criteria.
On the other hand, our results showed that cross-validation is a viable alternative to using information criteria with approximate solutions.Importantly, no threshold parameter s is required when using cross-validation for model selection.However, cross-validation is time consuming and may therefore be impractical for some models.
A promising alternative is to adapt the computation of the number of parameters t in the BIC (e.g., Geminiani et al., 2021;Muggeo, 2010;Ulbricht, 2010).The approximation strategy proposed by Geminiani et al. (2021) and based on Ulbricht (2010), for instance, uses the generalized information criterion (Konishi & Kitagawa, 1996).This approach does not require the threshold parameter s and is implemented in the R package penfa (Geminiani et al., 2021) for confirmatory factor analyses.However, to the best of our knowledge, this approximation strategy has not yet been extended to SEM.Additionally, the specific approximations proposed by Geminiani et al. (2021) are fairly involved, reducing the simplicity-advantage of implementing approximate solutions.On the other hand, Geminiani et al. ( 2021) also proposed confidence intervals and an automatic tuning parameter selection for their approximation.Extending this approach to SEM, therefore, appears promising.Alternatively, procedures may be developed that allow for a more sophisticated threshold parameter s (see e.g., Cao et al., 2017, for such a procedure in group bridge penalized regression).Finally, for a different smooth approximation, Muggeo (2010) found that their smoothing parameter c can have a considerable effect on the model selection even if an adapted computation for the number of parameters t is used.This should be considered when implementing such approximations.
On a final note, our findings are most likely not limited to regularized SEM: Other regularized models may also be fitted with either general purpose or specialized optimizers.However, the situation in regularized SEM is special in that one of the most popular implementations of regularized SEM to date relies on the combination of a general purpose optimizer for estimation and the BIC for model selection by default.Other packages for regularized estimation use specialized optimizers instead (e.g., lslx, Huang, 2020a;glmnet, Friedman et al., 2010;ncvreg, Breheny & Huang, 2011, or LIBLINEAR, Fan et al., 2008).Thus, while our findings may generalize to other models, they are of central importance to the regularization of SEM.
Our study comes with several limitations.First, we only looked at the lasso penalty.One may therefore wonder if the results generalize to the wealth of other penalty functions proposed in the literature, such as the adaptive lasso (Zou, 2006), elastic net (Zou & Hastie, 2005), scad (Fan & Li, 2001), lsp (Cand es et al., 2008), cappedL1 (Zhang, 2010), and mcp (Zhang, 2010).Importantly, all of the penalties mentioned above use the absolute value function to obtain sparse solutions.Replacing this absolute value function with an . in the number of non-zero parameter estimates approximate and exact solutions when using the BIC for model selection.A value of 3, for instance, indicates that the approximate solution (using BFGS) has three more parameters than the exact solution (using GLMNET).Each dot represents one of the 500 simulation runs.Red dots indicate that the approximate solution selected the maximum likelihood estimates (i.e., k ¼ 0).approximation, as done for the lasso penalty in this study, results in non-sparse solutions and requires the threshold parameter s.Still, more in-depth analyses for these penalty functions may be worthwhile given that they also come with additional tuning parameters besides k.Second, we did not compare the final parameter estimates of the general purpose optimizer and the specialized optimizer to the true parameters used in the simulation.The reason for this is that the objective of the approximation used in the general purpose optimizer is to return parameter estimates, which are as close as possible to the exact solution of the specialized optimizer.Outperforming exact solutions is not the reason for existence of the approximation.And even if approximate solutions would outperform exact solutions for some specific settings of and s, this may just reflect that approximate solutions have more tuning parameters.Making use of these tuning parameters would require optimizing models for an entire tuning grid (combinations of all k, , and s values), resulting in a considerable increase in run time.This is especially true because s cannot be selected with information criteria. 10Third, we restricted our analyses to models fitted with lessSEM using two optimizers only (BFGS and GLMNET).These optimizers were selected because of their popularity and similarity due to both employing quasi-Newton procedures.However, it remains unclear if the results generalize to other optimizers as well.
All in all, we conclude that specialized optimizers should be preferred over general purpose optimizers in regularized SEM when using the current implementations of information criteria to select the final parameter estimates.These specialized optimizers are the default in lslx and lessSEM, and optionally available in regsem.When developing new regularized SEMs (e.g., models estimated with the Kalman filter), however, our results leave researchers between a rock and a hard place.General purpose optimizers are widely available and therefore easier to implement but may result in questionable model selection if standard information criteria are used.Implementing specialized optimizers, on the other hand, is time consuming and error prone.Here, lessSEM can provide a starting point because all optimizers implemented therein can be accessed using both, R and Cþþ (see Appendix C for an example).In addition to the lasso penalty discussed above, researchers using lessSEM also have access to optimizers for all other penalty functions mentioned in the introduction (namely ridge, adaptive lasso, elastic net, cappedL1, lsp, scad, and mcp).Combined with the additional features of lessSEM (full information maximum likelihood estimation, multigroup models, definition variables, parameters transformations, and mixtures of penalties), this allows for the rapid development of new regularized models.

A.1. A Short Introduction to Quasi-Newton Optimization
Assume that F reg, k ðhÞ is twice continuously differentiable and that in iteration k we are at the point h k : Points of F reg, k ðhÞ close to h k are approximated using a truncated Taylor series: Nocedal & Wright, 2006).In this case, the step of length s k is typically set to 1. Figure 2 shows such Newton steps, with the blue lines depicting the approximations using Equation (A.1) at the points h 1 , h 2 , h 3 , and h 4 .However, computing the true Hessian is often prohibitively expensive and one may resort to an approximation thereof.In lessSEM, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) approximation (see e.g., Nocedal & Wright, 2006), resulting in a quasi-Newton procedure.When approximating the Hessian, s k ¼ 1 may be a poor choice and the best step length is therefore found in a second optimization (e.g., Nocedal & Wright, 2006).To increase the similarity between the quasi-Newton optimizer outlined here and the GLMNET optimizer introduced in the next section, we used the same procedure to find the step size s k : We decreased s until the criterion shown in Equation (A.2) was met, where r ¼ 0:1: This criterion is a simplification of the criterion proposed by Yuan et al. (2012) for GLMNET, where we take the differentiability of the approximated penalty into account.In fact, Equation (A.2) is identical to the Armijo condition used by many optimizers (e.g., Nocedal & Wright, 2006, p. 33).
Note that Equation (A.2) enforces a decrease in F reg, k ðh k Þ from one iteration to the next.This follows from The matrix H is a positive definite (approximation of the) Hessian matrix.The inequality follows from (1) H being positive definite and therefore H À1 being positive definite and (2) the definition of a positive definite matrix as a matrix M, where x > Mx > 0 for any non-zero vector x (e.g., Petersen & Pedersen, 2012, pp. 50-51).

B.1. A Short Introduction to the GLMNET Optimizer
We will restrict the presentation of the GLMNET optimizer to a short primer.A thorough derivation is provided in the original publications (see Friedman et al., 2010;Yuan et al., 2012).We also recommend reading Appendix A before, as the quasi-Newton optimizer presented therein provides more background information on the general ideas used in the following.The GLMNET optimizer is also implemented in lslx (Huang, 2020a) and regCtsem (Orzek & Voelkle, 2023).
GLMNET is based on the assumption that the maximum likelihood fitting function F ML ðhÞ in Equation ( 1) is at least twice differentiable.Then, F reg, k ðhÞ can be approximated with Note that the penalty function pðhÞ is not approximated in Equation (B.1).From iteration k to iteration k þ 1, our objective is to make Equation (B.1) smaller; or formally, we search a direction vector d ¼ h À h k which minimizes 2) The new problem, minimizing qðdÞ, is then solved with a coordinate descent procedure (Yuan et al., 2012).To this end, a second optimization is used, where single elements of h are updated iteratively.Broadly speaking, the procedure is similar to the simplified example developed in the main text.However, the details are beyond the scope of the current article and the interested reader is referred to Friedman et al. (2010) and Yuan et al. (2012).In lessSEM, we additionally replace the Hessian matrix with the BFGS approximation.
After finding step direction d k with the coordinate descent procedure, the step length s k is found by decreasing s until the criterion in Equation (B.3) is met, where r ¼ 0:1 (this criterion was proposed by Yuan et al., 2012 and is also used by Huang, 2020a): For our purpose, the most important message is that the quasi-Newton optimizer presented in Appendix A and the GLMNET optimizer discussed here are similar in that they rely on truncated second order Taylor series approximations to F reg, k ðh k Þ: Both use the BFGS approximation for the Hessian matrix and rely on a similar procedure to find step size s k .
(the results from regsem differ between s values because mult.start ¼ TRUE uses random starting values).Both packages resulted in nearly identical fit values.Figure D3 shows the BIC and the number of non-zero parameters for regsem and lessSEM.Despite the similarity in fit, both packages select different final parameter estimates.

Figure 1 .
Figure1.Parameter estimates for four different penalty functions (ridge, lasso, elastic net, and scad) for increasing values of k.The dots represent the selected points of k at which parameters were estimated.

Figure 2 .Figure 3 .
Figure2.Steps taken by an optimizer.The red line is the function we want to minimize.The dots indicate the steps taken during the optimization and the blue lines are the approximations generated at these points (see Appendix A).Each point h k is found by minimizing the approximation at the previous point h kÀ1 :

Figure 4 .
Figure 4. Penalty function values for a single parameter h.Both, ridge and smoothed lasso are differentiable.

Figure 5 .
Figure 5. BIC for increasing k values based on a single data set from the simulation study (see below).Dots indicate the discrete k values for which models were fitted.

Figure 6 .
Figure 6.CFA for the bfi data set.Dashed blue paths are regularized cross-loadings.

Figure 7 .
Figure7.BIC and number of parameters for the N ¼ 500 subset of the bfi data set when using approximate solutions (with the general purpose optimizer BFGS).The blue line is the exact solution (with the specialized optimizer GLMNET).Diamonds represent the respective selected models.

Figure 9 .
Figure9.Quantile plots showing the sum of squared differences between exact solutions from the specialized optimizer and approximate solutions from the general purpose optimizer.Quantiles were computed for each k separately.The black line shows the median.Values outside of the 95% quantile are not shown.A figure with all data points is included in the osf repository at https://osf.io/kh9tr/.

Figure 10 .
Figure10.Regularized parameters estimates of the first repetition for exact solutions (using the specialized optimizer GLMNET) and approximate solutions (using the general purpose optimizer BFGS) for different levels of smoothing parameter .

Figure 11 .
Figure11.Sum of squared differences between exact solutions (using the specialized optimizer GLMNET) and approximate solutions (using the general purpose optimizer BFGS) in the final parameter estimates when using the BIC for model selection.Each dot represents one of the 500 simulation runs.Red dots indicate that the approximate solution selected the maximum likelihood estimates (i.e., k ¼ 0).
Figure12.Sum of squared differences between exact solutions (using GLMNET) and approximate solutions (using BFGS) in the final parameter estimates when using 5-fold cross-validation for model selection.
Figure13. in the number of non-zero parameter estimates approximate and exact solutions when using the BIC for model selection.A value of 3, for instance, indicates that the approximate solution (using BFGS) has three more parameters than the exact solution (using GLMNET).Each dot represents one of the 500 simulation runs.Red dots indicate that the approximate solution selected the maximum likelihood estimates (i.e., k ¼ 0).

Figure D2 .
Figure D2.Relative fit of regsem as compared to lessSEM with GLMNET optimizer.A value of 1.01, for example, indicates a 1% larger fitting function value for regsem.The gaps represent non-convergent results based on (1) non-zero convergence codes, (2) negative variances, or (3) invalid correlations between latent variables in regsem.

Figure D3 .
Figure D3.BIC and number of parameters for the bfi data set when using the regsem package.The blue line is the exact solution returned by lessSEM using the GLMNET optimizer.All other lines are the solutions returned by regsem using different levels of s.Diamonds represent the respective selected models.The gaps in the BIC and the number of parameters for regsem represent non-convergent results based on (1) non-zero convergence codes, (2) negative variances, or (3) invalid correlations between latent variables.
Figure 8. Quantile plots showing the relative fit.The quantiles were computed for each k value separately.Values above 1 indicate better fit of the exact solution (using the specialized optimizer GLMNET), values below 1 indicate better fit of approximate solutions (using the general purpose optimizer BFGS).The black line shows the median.Values outside of the 95% quantile are not shown.A figure with all data points is included in the osf repository at https://osf.io/kh9tr/.