Computational issues in parameter estimation for hidden Markov models with template model builder

A popular way to estimate the parameters of a hidden Markov model (HMM) is direct numerical maximization (DNM) of the (log-)likelihood function. The advantages of employing the TMB [Kristensen K, Nielsen A, Berg C, et al. TMB: automatic differentiation and Laplace approximation. J Stat Softw Articles. 2016;70(5):1–21] framework in R for this purpose were illustrated recently [Bacri T, Berentsen GD, Bulla J, et al. A gentle tutorial on accelerated parameter and confidence interval estimation for hidden Markov models using template model builder. Biom J. 2022 Oct;64(7):1260–1288]. In this paper, we present extensions of these results in two directions. First, we present a practical way to obtain uncertainty estimates in form of confidence intervals (CIs) for the so-called smoothing probabilities at moderate computational and programming effort via TMB. Our approach thus permits to avoid computer-intensive bootstrap methods. By means of several examples, we illustrate patterns present for the derived CIs. Secondly, we investigate the performance of popular optimizers available in R when estimating HMMs via DNM. Hereby, our focus lies on the potential benefits of employing TMB. Investigated criteria via a number of simulation studies are convergence speed, accuracy, and the impact of (poor) initial values. Our findings suggest that all optimizers considered benefit in terms of speed from using the gradient supplied by TMB. When supplying both gradient and Hessian from TMB, the number of iterations reduces, suggesting a more efficient convergence to the maximum of the log-likelihood. Last, we briefly point out potential advantages of a hybrid approach.


Introduction
Hidden Markov models (HMMs) are a well-studied and popular class of models in many areas.While they have been used for speech recognition historically (see, e.g.Juang and Rabiner, 1991;Baum and Petrie, 1966;Rabiner, 1989;Rabiner and Juang, 1986;Fredkin and Rice, 1992), these models also became important in many other fields due to their flexibility.These are, to name only a few, biology and bioinformatics (Schadt et al., 1998;Durbin, 1998;Eddy, 1998), finance (Hamilton, 1989), ecology (Mc-Clintock et al., 2020), stochastic weather modeling (Lystig and Hughes, 2002;Ailliot et al., 2015), and engineering (Mor et al., 2021).In short, HMMs are a class of models where the given data is assumed to follow varying distributions according to an underlying unknown Markov chain.Parameter estimation for HMMs is typically achieved either by the Baum-Welch (BW) algorithm (Baum et al., 1970;Dempster et al., 1977;Rabiner, 1989;Liporace, 1982;Wu, 1983) -an Expectation-Maximization(EM)type algorithm -or in a quite straightforward fashion by direct numerical maximization (DNM) of the m-state Poisson HMM, i.e. with the conditional distribution p i (x) = P(X t = x|C t = i) = e −λi λ x i x! with parameters λ i , i = 1, ..., m.Secondly, we consider Gaussian HMMs with conditional distribution specified by with parameters (µ i , σ i ), i = 1, . . ., m.In addition, Γ = {γ ij }, i, j = 1, ..., m denotes the transition probability matrix (TPM) of the HMM's underlying Markov chain, and δ is a vector of length m collecting the corresponding stationary distribution.We assume that the Markov chains underlying our HMMs are irreducible and aperiodic.This ensures the existence and uniqueness of a stationary distribution as the limiting distribution (Feller, 1968).However, these results are of limited relevance for most estimation algorithms, because the elements of Γ are generally strictly positive.Nevertheless, one should be careful when manually fixing selected elements of Γ to zero.The basis for our estimation procedure is the (log-)likelihood function.We denote the "history" of observations x t and the observed process X t up to time t by x (t) = {x 1 , . . ., x t } and X (t) = {X 1 , . . ., X t }, respectively.In addition, let θ be the vector of model parameters.As explained, e.g., by (Zucchini et al., 2016, p. 37), the likelihood function can then be represented as a product of matrices: L(θ) = P(X (T ) = x (T ) ) = δP (x 1 )ΓP (x 2 )ΓP (x 3 ) . . .ΓP (x T )1 , where the m conditional probability density functions evaluated at x (we use this term for discrete support as well) are collected in the diagonal matrix . . .
, and 1 and δ denote a transposed vector of ones and the stationary distribution, respectively.That is, we assume that the initial distribution corresponds to δ.The likelihood function given in Equation 1 can be efficiently evaluated by a so-called forward pass through the observations, as illustrated in Appendix A. Consequently, it is possible to obtain θ, the ML estimates of θ, by -in our case -unconstrained DNM.Since several of our parameters are constrained, we rely on known re-parametrization procedures (see Appendix B for details).
Confidence intervals (CIs) for the estimated parameters of HMMs can be derived via various approaches.The most common ones are Wald-type, profile likelihood, and bootstrap-based CIs.Bacri et al. (2022) shows that TMB can yield valid Wald-type CIs in a fraction of the time required by classical bootstrap-based methods as investigated e.g. by Bulla and Berzel (2008); Zucchini et al. (2016).Furthermore, the likelihood profile method may fail to provide CIs (Bacri et al., 2022).Therefore, we rely on Wald-type confidence intervals derived via TMB.For details, see Appendix C.

Uncertainty of smoothing probabilities
When interpreting the estimation results of an HMM, the so-called smoothing probabilities often play an important role.These quantities, denoted by p it (θ) in the following, correspond to the probability of being in state i at time t given the observations, i.e., p it (θ) = P θ (C t = i|X (T ) = x (T ) ) = α t (i)β t (i) L(θ) , and can be calculated for i = 1, . . ., m and t = 1, . . ., T (see, e.g., Zucchini et al., 2016, p. 87).
The involved quantities α t (i) and β t (i), which also depend on θ, result directly from the forward-and backward algorithm, respectively, as illustrated in Appendix A. Estimates of p it (θ) are obtained by p it ( θ) where θ is the ML estimate of θ.
An important feature of TMB is that it not only permits obtaining standard errors for θ (and CIs for θ), but in principle also for any other quantity depending on the parameters.This is achieved by combining the delta method with automatic differentiation (see (Kristensen et al., 2016) for details).A minor requirement for deriving standard deviations for p it ( θ) via the ADREPORT function provided by TMB consists in implementing the function p it (θ) in C++.Then, this part has to be integrated into the scripts necessary for the parameter estimation procedure.It is noteworthy that, once implemented, the procedures related to inference of the smoothing probabilities do not need to change when the HMM considered changes, because the input vector for p it (θ) remains the same.The C++ code in the supporting information illustrates how to complete these tasks.Note that population value we are interested in constructing a CI for is p it (θ), which should not be confused with That is, we treat p it (θ) as the probability of the event C t = i conditional on X (T ) being equal to the particular sequence of observations x (T ) .After all, it is this sequence and the corresponding smoothing probabilities which are of interest to the researcher.
Quantifying the uncertainty in the estimated smoothing probabilities can be of interest in multiple fields because the underlying states are often linked to certain events of interest (e.g., the state of an economy or the behavior of a customer).In the following, we illustrate the results of our approach through a couple of examples.

Track Your Tinnitus
The "Track Your Tinnitus" (TYT) mobile application gathered a large data set, a description of which is detailed in Pryss et al. (2015b) and Pryss et al. (2015a).The data plotted in Figure 1 shows the so-called "arousal" variable reported for an individual over 87 consecutive days.This variable takes high values when a high level of excitement is achieved and low values when the individual is in a calm emotional state.The values are measured on a discrete scale; we refer to Probst et al. (2016Probst et al. ( , 2017) ) for details.  4.
We estimated Poisson HMMs with varying number of states by nlminb with TMB's gradient and Hessian functions passed as arguments (this approach was chosen for all estimated models in our examples).The preferred HMM in terms of AIC and BIC is a two-state model, see Table 4 in Appendix D.
Figure 2 displays the corresponding smoothing probabilities with 95% Wald-type CIs, constructed using the standard error provided by TMB.Intuitively, one might expect the uncertainty to be low when a smoothing probability takes values close to zero or one, whereas higher uncertainty should be inherent to smoothing probabilities further away from these bounds.The CIs illustrated in Figure 2 follow this pattern.However, as pointed out by Bacri et al. (2022), this data set is atypically short for fitting HMMs.As we will see in the following, different patterns will emerge for longer sequences of observations.4.

Soap
In the following, we consider a data set of weekly sales of a soap in a supermarket.

Weekly returns
The data set considered in this section are 2230 weekly log-returns based on the adjusted closing share price of the S&P 500 stock market index retrieved from Yahoo Finance, between January 1 st 1980 and September 30 th 2022.The returns are expressed in per cent to facilitate reading and interpreting the estimates.As shown, e.g., by Rydén et al. (1998), Gaussian HMMs reproduce well certain stylized facts of financial returns.We thus estimated such models with varying number of states.A three-state model is preferred by the BIC, whereas the AIC is almost identical for three and four states (see Table 4).
The estimates show a decrease in conditional standard deviation with increasing conditional mean.This is a well-known property of many financial returns (see, e.g., Schwert (1989);maheu and McCurdy (2000); Guidolin and Timmermann (2005)) related to the behavior of market participants in crisis and calm periods.Figure 5 shows the first 200 data points plotted along with conditional means from the preferred model.The subsequent Figure 6 shows inferred smoothing probabilities together with their CIs.In addition to the previous applications, where smoothing probabilities close to the boundary seemed to be linked to lower uncertainty, this data set shows also some clear exceptions from this pattern.For example, the smoothing probabilities of State 3 inferred at t = 30 and t = 83 take the relatively close values 0.51 and 0.53, respectively.However, the associated uncertainty at t = 30 is visibly higher than the corresponding uncertainty at t = 83 (the estimated standard errors are 0.205 and 0.045).This may be explained by the fact that the inferred states closely around t = 30 oscillated between State 2 and 3, thus indicating a high uncertainty of the state classification during this period.On the contrary, around t = 83 a quick and persistent change from State 2 to 3 takes place.Further details on the model estimates are available in Table 4.

Hospital
Basis for our last example is a data set issued by the hospital "Hôpital Lariboisière" from Assistance Publique -Hôpitaux de Paris (a french hospital trust).This data set consists of the hourly number of patient arrivals to the emergency ward during a period of roughly 10 years.A subset of the data (over one week) is displayed in Figure 7.We examine this due to several reasons.First, with 87648 observations, this data set is much larger than the ones examined previously.Secondly, the medical staff noticed differences between the rates of patient arrivals at day and night, respectively, which motivates the use of, e.g., an HMM.Last, a Poisson HMM is a natural candidate for the observed count-type data.
The preferred model by both AIC and BIC has five states (see Table 5 in Appendix D) and confirms the impression of the hospital employees: State 5 is mainly visited during core operating hours during day time.The fourth and third states mainly occur late afternoon, early evening, and early morning.Last, State 1 and 2 correspond to night time observations.Even for a data set of this size, the smoothing probabilities and corresponding CIs can be derived with moderate computational effort and time: Figure 8 displays the results.Overall, the inferred CIs are relatively small, in particular in comparison with those obtained for the stock return data discussed  4).For readability, only the first 200 data are plotted.
in the previous example.The low uncertainty in the state classification may most likely result from the clear, close to periodic transition patterns.

Performance and accuracy of different optimizers
This section compares the speed and accuracy of different optimizers in R using TMB in a HMM estimation setting with DNM by several simulation studies with different settings.
In the first setting, we simulate time series consisting of 87 observations from a two-state Poisson HMM fitted to the TYT data, as we want to examine the different optimizers performance on data sets with relatively few observations.In the second setting, we simulate time series of length 200 from a two-state Poisson HMM, and in the third setting, time series of length 200 from a two-state Gaussian HMM are simulated.The specific choice of parameters is described in more detail in the forthcoming sections.
The comparisons that focus on computational speed and iterations of the different optimizers, are based on 200 replications from the different models, while the studies focusing on the accuracy of the optimizers are based on 1000 replications from the different models.Hence, the Monte-Carlo simulation setup in this paper closely resembles the setup of Bacri et al. (2022).

Computational setup
The results presented in this paper required installing and using the R-package TMB and the software Rtools, where the latter is needed to compile C++ code.Scripts were coded in the interpreted programming language R (R Core Team, 2021) using its eponymous software environment RStudio.For the purpose of making our results reproducible, the generation of random numbers is handled with the function set.seedunder R version number 3.6.0.A seed is set multiple times in our scripts, ensuring that smaller parts can be easily duplicated without executing lengthy prior code.A workstation with 4 Intel(R) Xeon(R) Gold 6134 processors (3.7 GHz) running under the Linux distribution Ubuntu 18.04.6LTS (Bionic Beaver) with 384 GB RAM was used to execute our scripts.4.

Selected optimizers and additional study design
There exist a large number of optimizers for use within the R ecosystem, see e.g.Schwendinger and Borchers (2022).In a preliminary phase, multiple optimization routines were attempted, and the ones failing to converge on a few selected test data sets from the designs mentioned above were excluded (details available from the author upon request).Hence, the following optimizers were selected for the simulation studies: • optim from the stats package to test the BFGS (Broyden, 1970;Fletcher, 1970;Goldfarb, 1970;Shanno, 1970), L-BFGS-B (Byrd et al., 1995), CG (Fletcher, 2013), and Nelder-Mead (Nelder and Mead, 1965) algorithms.
• nlm (Dennis and Schnabel, 1996) and nlminb (Gay, 1990) from the same package to test popular non-linear unconstrained optimization algorithms.
• hjn from the optimr package (Nash and Varadhan, 2019) to test the so-called Hooke and Jeeves Pattern Search Optimization (Hooke and Jeeves, 1961).  5.
• ucminf from the ucminf package (Nielsen and Mortensen, 2022) to test a general-purpose optimization based on the Marquardt-Levenberg algorithm (Nielsen, 2000).
• BBoptim from the BB package (Varadhan and Gilbert, 2019) to test a spectral projected gradient method for large scale optimization with simple constraints (Varadhan and Gilbert, 2010).
By also providing the Hessian and/or gradient to the optimizers that support such inputs, 22 optimization procedures are examined for each study design.For the study of the computational speed and number of iterations, we applied the microbenchmark package (Mersmann, 2021) that allows us to obtain precise durations for each of the simulated data sets.Furthermore, HMM ML estimates were reordered by increasing Poisson rates or Gaussian means to avoid a random order of the states.
It is important to note that we discarded all those samples for which the simulated state sequence did not sojourn at least once in each state.This was to avoid identifiability and convergence issues when trying to estimate a m-state HMM on a data set where only m − 1 states are visited.Further, samples where at least one optimizer failed to converge were also discarded to ensure comparability.For the same reason, we set the maximum number of iterations to 10 4 whenever possible.Note that some optimizers (BFGS, L-BFGS-B, CG, Nelder-Mead, hjn, newuoa) do not report a number of iterations and are therefore missing from our comparisons of iterations.Finally, we use the true values as initial parameters.For the TYT data, initial values are calculated as medians from the fitted models resulting from all optimizers listed above.

Results
In this section, we report the results from the simulation studies focusing on the performance of different optimizers in terms of the computational speed and the lack or presence of unreasonable estimates.This   5.
section uses R scripts that may be of interest to users investigating their own HMM settings, and are available in the supporting information.We begin by reporting the speed of all optimization routines and thereafter the accuracy.The routine names contain the optimizer names and may be followed by either " gr" to indicate that the routine uses TMB's provided gradient function, " he" to indicate the use of TMB's provided Hessian function, or " grhe" to indicate the use of both gradient and Hessian functions provided by TMB.Note that some optimizers cannot make use of these functions and hence bear no extension at the end of their names.Figure 9 shows the time required by each optimization routine, measured over the 200 replication.The number of iterations required by each optimization routine is also reported where available.Times range from 1 to 1340 milliseconds.As shown by the large medians and wide CIs, CG and CG gr require substantially more time to estimate a HMM, compared to the other optimizers.Furthermore, optimizers that take a long time to estimate HMMs require a high amount of iterations as well, as expected.Notably, optimization routines almost always take longer when the Hessian is passed.This likely results from a higher computational burden due to handling and evaluating Hessian matrices via TMB.
Looking at the details, the optimizer BFGS gr is among the fastest optimizers, and this is partly explained by a relatively low number of iterations, and by the fact that it does not make use of the Hessian.However, we note that nlm and nlmimb families of optimizers are slower when the gradient and Hessian are not passed as argument.The optimizer BBoptim is comparatively slow, and nlminb grhe fails to converge comparably often.
In addition to time durations, we investigate the accuracy of the optimizers, based on another 1000 replications from the two-state Poisson HMM in the first setting.The main conclusion from this simulation is that the medians and empirical quantiles calculated are in fact almost identical across all optimizers, and very close to the true parameter values used in the simulations.Results are reported in Figure 13 in Appendix E. These result imply that the different optimizers are equally accurate.Note, however, that the variation is quite high for some of the estimated parameters and for the nll, which results most likely from the limited number of observations.

Simulation study: two-state Poisson HMM
In this second simulation setting, the parameters of our two-state Poisson HMM are given by λ = (1, 7), Γ = 0.95 0.05 0.15 0.85 , and the sample size is fixed at 200. Figure 10 illustrates our results, which are in line with those obtained in the first setting.More precisely, BBoptim still dominates the number of iterations, and CG and CG gr require more computational time than the others.Overall, most optimizers need more time compared to the previous setting as there is more data to process, but the time increase for nlm he and hjn is much more substantial than for other optimizers.Moreover, although the data set is larger, some optimizers (e.g.BBoptim and BFGS gr) perform as fast as or faster than in the previous setting.Furthermore, passing TMB's Hessian to optimizers does not always slow down the optimization.This can be seen, e.g., from nlminb he which exhibits a lower median time than nlminb.A last notable pattern concerns the popular nlm optimizer.When only the Hessian but not the gradient is supplied from TMB, the computational time increases substantially compared to nlm gr and nlm grhe.Interestingly, the equally popular optimizer nlminb is not affected in the same way.Thus, one should be careful when only supplying the Hessian from TMB while an optimizer carries out the gradient approximation.In any case, the low number of iterations required by both nlminb grhe and nlm grhe indicate a preferable convergence behavior when supplying both quantities.
In terms of accuracy of the parameter estimates and obtained nlls, our results are highly similar to those described in the first setting (for details, see Figure 14 in Appendix E).

Simulation study: two-state Gaussian HMM
In our third and last simulation setting, the parameters of a two-state Gaussian HMM are given by Γ = 0.95 0.05 0.15 0.85 , µ = (−5, 5), σ = (1, 5), and the sample size is fixed at 200.In terms of computational time and the number of iterations, the results observed for this setup are similar to the previously studied Poisson HMMs (see Figure 11).A minor aspect is an increase in both iterations and computational time, which is not surprising given the higher number of parameters.
Regarding estimation accuracy of the parameter estimates and obtained nlls, our results are again highly similar to those obtained in the previous two settings (illustrated graphically by Figure 15 in Appendix E).

Robustness to initial value selection
In this section, we examine the impact of initial values of the parameters -including poorly selected ones -on the convergence behavior and accuracy of different optimizers.We investigate one real and several simulated data sets.In order to "challenge" the optimizers, all data sets are of comparable small size.

Study design
We again consider three settings in the following.In the first, we investigate the TYT data set.In the second and third setting, we focus on simulated data from Poisson and Gaussian HMMs, respectively.More specifically, for the Poisson HMMs we generate one data set of sample size 200 for each of the HMMs considered.These HMMs are defined by all possible combinations of the following Poisson rates and TPMs: λ ∈ {(1, 4), (5, 7)} and Γ ∈ 0.9 0.1 0.2 0.8 , 0.7 0.3 0.8 0.2 , 0.1 0.9 0.8 0.2 , 0.55 0.45 0.45 0.55 .
This setting thus generates 2 x 2 x 4 data sets.
For each data set, we pass a large range of initial values to the same optimizers considered in the previous section.In this way, we investigate the resistance of each optimizer to potentially poor initial values when fitting HMMs.To generate sets of initial values, we consider the following potential candidates for Poisson rates λ, Gaussian means µ, Gaussian standard deviations σ, and TPMs Γ: and Γ ∈ 0.1 0.9 0.9 0.1 , 0.1 0.9 0.8 0.2 , . . ., 0.1 0.9 0.1 0.9 , 0.2 0.8 0.9 0.1 , 0.2 0.8 0.8 0.2 , . . ., 0.2 0.8 0.1 0.9 , , . . .0.9 0.1 0.1 0.9 , where x min = min(x 1 , x 2 , . . ., x T ), x max = max(x 1 , x 2 , . . ., x T ), µ = 1 T T i=1 x i and M = max(0.5,x min ).The motivation for this selection is as follows.First, Poisson means have to be greater than zero, so we set their lower boundary to M .Secondly, the λ i 's have to belong to the interval (x min , x max ), cfr.(Böhning, 1999).This applies to the Gaussian means as well.Thirdly, the upper and lower limit of σ 1 , σ 2 is motivated by the Bhatia-Davis (Bhatia and Davis, 2000) and the von Szokefalvi Nagy inequalities (Nagy, 1918).
For the Poisson HMMs fitted, we consider all possible combination of the parameters described above.For the Gaussian HMMs, however, we sample 12500 initial parameters for each of the 16 data sets to reduce computation time (the total amount reduces from 39066300 to 200000).As in the previous Section 4, we limit the maximal number of iterations to 10 4 whenever possible.Other settings were left as default.
To evaluate the performance of the optimizers considered, we rely on two criteria.First, we register whether the optimizer converges (all errors are registered regardless of their type).Secondly, when an optimizer converges, we also register if the nll is equal to the "true" nll (with a gentle margin of ± 5%).We calculate this "true" nll as the median of all nlls derived with all optimizers initiated with the true parameter values.

Results
Table 1 shows the results for the two-state Poisson HMMs fitted to the TYT data.The four marqLevAlg follow closely.We note that hjn has the best performance: it basically does not fail and converges almost always.Moreover, nlminb grhe fails to converge much more often than other optimizers, but is the most likely to find the global maximum when it converges.For the other optimizers, the failure rate is generally low (less than 5%) and the global maximum is found in more than 65% of the cases.Table 2 reports the results from the second setting, also with two-state Poisson HMMs.The results show that all optimizers find the global maximum in the majority of cases (> 96%) when converging.Moreover, the failure rates are relatively low for all optimizers with the exception of CG, which stands out with comparably high failure rates.
Finally, Table 3 presents the results from the third setting with Gaussian HMMs.The results regarding the failure rates point in the same direction as those from the Poisson HMM setting: CG attains the highes failure rate, followed by newuoa and L-BFGS-B.The remaining optimizers perform satisfactorily.Note, however, that nlminb does not benefit from being supplied with the gradient and Hessian from TMB. Regarding the global maximum found, many algorithms reach success rates of about 95% or more.Notable exceptions are Nelder-Mead and nlm when not both gradient and Hessian are provided by TMB.As observed previously for the TYT data, nlm benefits strongly from the full use of TMB.

Hybrid algorithm
Last, we ran a small simulation study investigating a hybrid algorithm within the TMB framework in the spirit of Bulla and Berzel (2008).The hybrid algorithm starts with the Nelder-Mead optimizer and switches to nlminb after a certain number of iterations.The idea is to benefit from both Nelder-Mead's rapid departure from poor initial values and from the high convergence speed of nlminb.Our study is performed on 1000 random sets of initial values in a similar fashion to Section 5, using the weekly returns data set.For every set of initial values, we sequentially increase the number of iterations carried out by Nelder-Mead by ten (starting at one) until convergence or until 10000 iterations has been reached, in which case we classify the attempt as a failure.For comparison, we also run an optimization only with nlminb starting directly with the same sets of inital values.In both cases, nlminb benefits from the gradient and Hessian provided by TMB.
Figure 12 reports the results of our study.The left most columns show that only one iteration was required to converge in 827 of the 1000 sets of initial values with the hybrid algorithm, while nlminb converges in only 698 out of these 827 sets.Furthermore, for 39 additional sets of initial values, the hybrid algorithm converges with 10 iterations carried out by the Nelder-Mead optimizer, whereas nlminb fails 26 times on these sets.Increasing the number of initial iterations carried out by Nelder-Mead for the hybrid algorithm leads to similar patterns.Finally, both optimization routines fail to converge for 18 sets of initial values.Thus, in total, the hybrid algorithm with one or more Nelder-Mead iterations failed in only 1.8% of the cases.In comparison, direct use of nlminb failed in around 12% of the cases.This indicates that the hybrid algorithm requires rather few iterations of Nelder-Mead to improve the convergence rate substantially.

Concluding remarks
This paper addresses a couple of aspects concerning parameter estimation for HMMs via TMB and R using direct numerical maximization (DNM) of the likelihood.The advantage of TMB is that it permits to quantify the uncertainty of any quantity depending on the estimated parameters.This is achieved by combining the delta method with automatic differentiation.Moreover, TMB provides exact calculations of first-and second-order derivatives of the (log-)likelihood of a model by automatic differentiation.This allows for efficient gradient-and/or Hessian-based optimization of the likelihood.
In the first part of the paper, we propose a straightforward technique to quantify the uncertainty of smoothing probabilities via the calculation of Wald-type CIs.The advantage of this approach is that one avoids computationally intensive bootstrap methods.By means of several examples, we illustrate how such CIs provide additional insight into the commonly used state classification via smoothing probabilities.For a practitioner working with HMMs, the presented uncertainty quantification constitutes a new tool for obtaining a better understanding of the dynamics of the hidden process.
Subsequently, we examine speed and accuracy of the different optimizers in three simulation settings.Our results show a slight variation in both number of iterations and computational speed.In particular, the optimizers nlminb, nlm, marqLevAlg, and BFGS usually possess the highest computational speed.The computing time required by all optimizers reduces when using the gradient provided by TMB.The number of iterations reduces in particular for nlminb and nlm when employing both gradient and Hessian from TMB.This suggests a more direct path towards the (global) likelihood maximum compared to optimizer-internal approximation routines.Regarding the accuracy, measured in terms of attained negative log-likelihood and parameter estimates (and their variability) the results show little variations across the different optimizers.This is indeed a positive finding, indicating that the performance of the optimizers is relatively equal in terms of accuracy -however, this statement only holds true when using optimal initial value.Then, we examine robustness towards initial values across different optimizers.In three settings, we measure a) how often optimizers hjn show an overall good performance.Notably, nlm benefits strongly from employing both gradient and Hessian provided by TMB, which is not the case for nlminb.Altogether, we observe a trade-off between failure rates and convergence to the global maximum.Nevertheless, none of the optimizers shows an exceptionally bad performance.Finally, we illustrate that a hybrid algorithm starting with the Nelder-Mead optimizer and switching to nlminb after a certain number of iterations converges more often than nlminb.Notably an effect is visible even with a single initial iteration carried out by Nelder-Mead.

B Reparameterization of the likelihood function
Along with the data, θ serves as input for calculating the likelihood by the previously illustrated forward algorithms.Some constraints need to be imposed on θ: (i) In the case of the Poisson HMM, the parameter vector λ = (λ 1 , . . ., λ m ) necessary for computing the conditional distribution has to consist of strictly positive elements.
(ii) The elements γ ij of the TPM Γ must be non-negative, and each row of Γ needs to sum up to to one.
Tacking these constraints by means of constrained optimization routines when maximizing the likelihood can lead to several difficulties.For example, constrained optimization routines may require a significantly amount of fine-tuning.Moreover, the number of available optimization routines is significantly reduced.A common alternative works by reparametrizing the log-likelihood as a function of unconstrained, so-called "working" parameters {T, η} = g −1 (Γ, λ), as follows.One possibility to reparametrize Γ is given by where τ ij ∈ R are m(m − 1) unconstrained elements of an m × m matrix T with missing diagonal elements.The matching reverse transformation is and the diagonal elements of Γ follow from j γ ij = 1, i = 1, ..., m (see Zucchini et al. (2016), p. 51).For a Poisson HMM, a simple method to reparametrize the conditional Poisson means λ into η is given by η i = log(λ i ) for i = 1, . . ., m.Consequently, the constrained "natural" parameters are obtained via λ i = exp(η i ).Estimation of the natural parameters {Γ, λ} can therefore be obtained by maximization of the reparametrized likelihood with respect to {T, η} then by a transformation of the estimated working parameters back to natural parameters via the above transformations, i.e. { Γ, λ} = g( P , η).In general, the above procedure requires that the function g is one-to-one.
With a Gaussian HMM, the means are already unconstrained and do not require any transformation.However, the standard deviations can be transformed similarly to the Poisson rates via η i = log(σ i ) for i = 1, . . ., m, and the "natural" parameters are then obtained via the corresponding reverse transformation σ i = exp(η i ).

C Confidence intervals
When handling statistical models, confidence intervals (CIs) for the estimated parameters can be derived via various approaches.One common such approach bases on finite-difference approximations of the Hessian.However, as Visser et al. (2000) points out, there are better alternatives when dealing with most HMMs.Preferred are profile likelihood-based CIs or bootstrap-based CIs, where the latter are now widely used despite the potentially large computational load (Bulla and Berzel, 2008;Zucchini et al., 2016).With common Poisson HMMs, Bacri et al. (2022) shows that TMB can yield valid Waldtype CIs in a fraction of the time required by classical bootstrap-based methods.
In this section, we describe the Wald-type and the bootstrap-based approach.The likelihood profile method frequently fails to yield CIs (see (Bacri et al., 2022)) and is therefore not detailed.

C.1 Wald-type confidence intervals
Evaluating the Hessian at the optimum found by numerical optimization procedures is basis for Wald-tye CIs.As illustrated above, the optimized negative log-likelihood function of a HMM typically depends on the working parameters.However, drawing inference on the natural parameters is by far more relevant than inference on the working parameters.Since the Hessian ∇ 2 log L({ T , η}) relies on the working parameters η, an estimate of the covariance matrix of { Γ, λ} can be obtained via the delta method through with { Γ, λ} = g( T , η) as defined in (Zucchini et al., 2016, Ch. 3.3.1)and Bacri et al. (2022).From this, TMB deduces the standard errors for the working and natural parameters, from which 95% CIs can be formed via λ ± q 0.975 • SE( λ) where q 0.975 ≈ 1.96 is the 97.5 th percentile from the standard normal distribution, and SE( λ) denotes the standard error of λ.

C.2 Bootstrap-based confidence intervals
The bootstrap method was described by Efron and Tibshirani (1993) in their seminal article, and is widely used by many practitioners.Many bootstrap techniques have been developed since then, and have been extensively applied in the scientific literature.This paper will not review these techniques, but will instead use one of them: the so-called parametric bootstrap.As Härdle et al. (2003) points out, the parametric bootstrap is suitable for time series, and hence motivates this choice.For details on the parametric bootstrap implementation in R, we refer to (Zucchini et al., 2016, Ch. 3.6.2, pp. 58-60).At its heart, bootstrapping requires some form of re-sampling to be carried out.Then, the chosen model is re-estimated on each new sample, and eventually some form of aggregation of the results takes place.We make use of TMB to accelerate the re-estimation procedure, then look at the resulting median along with the 2.5th and 97.5th empirical percentile to infer a 95% CI.

D Estimated models for real data sets
Table 4: Estimates, Wald-type 95% CIs, for a two-state Poisson HMM fitted to the TYT data, a two-state Poisson HMM fitted to the soap data, and a three-state Gaussian HMM fitted to the weekly returns data.From left to right, the columns contain: the parameter name, parameter estimate, and lower (L.) and upper (U.) bound of the corresponding 95% CI derived via the Hessian provided by TMB, repeated for the three HMMs.As standard deviations and Poisson rates must be non-negative, the CIs are adjusted when necessary.Similarly, the TPM elements and the stationary distribution must take values between 0 and 1 and the CIs are adjusted accordingly when necessary.AIC and BIC scores are displayed under for corresponding models with two, three, four, and five states (when the model estimation converges

Figure 1 :
Figure 1: Plot of a two-state Poisson HMM fitted to the TYT data (of size 87).The colored dots correspond to the conditional mean of the inferred state at each time.Model estimates are displayed in Table4.

Figure 2 :
Figure 2: Smoothing probabilities and confidence intervals of a two-state Poisson HMM fitted to the TYT data set.The solid line shows the smoothing probability estimates and the 95% CIs are represented by vertical lines.The bottom graph displays smoothing probabilities for the most likely state estimated for each data point.The vertical confidence interval lines are colored differently per hidden state.Further details on the model estimates are available in Table4.

Figure 4 :
Figure 4: Smoothing probabilities and confidence intervals of a two-state Poisson HMM fitted to the soap data set.The solid line shows the smoothing probability estimates and the 95% CIs are represented by vertical lines.The bottom left graph displays smoothing probabilities for the most likely state estimated for each data point.The vertical confidence interval lines are colored differently per hidden state.Further details on the model estimates are available in Table4.

Figure 5 :
Figure5: Plot of the weekly returns (of size 2230).The colored dots correspond to the conditional mean of the inferred state from a fitted 3-state Gaussian HMM (see Table4).For readability, only the first 200 data are plotted.

Figure 6 :
Figure 6: Smoothing probabilities and confidence intervals of a three-state Gaussian HMM fitted to the weekly returns data.The solid line shows the smoothing probability estimates and the 95% CIs are represented by vertical lines.The bottom right graph displays smoothing probabilities for the most likely state estimated for each data point.The vertical confidence interval lines are colored differently per hidden state.Only the first 200 out of the 2230 values are shown for readability purposes.Further details on the model estimates are available in Table4.

Figure 7 :
Figure7: Plot of the hospital data (of size 87648).The colored dots correspond to the conditional mean of the inferred state at each time.For readability, only the first week starting on Monday (from Monday January 4 th 2010 at 00:00 to Sunday January 10 th 2010 at 23:59) is plotted instead of the entire 10 years.Model estimates are displayed in Table5.

Figure 8 :
Figure8: Smoothing probabilities and confidence intervals of a five-state Gaussian HMM fitted to the hospital data set.The solid line shows the smoothing probability estimates and the 95% CIs are represented by vertical lines.The bottom right graph displays smoothing probabilities for the most likely state estimated for each data point.The vertical confidence interval lines are colored differently per hidden state.For readability, only the first week starting on Monday (from Monday January 4 th 2010 at 00:00 to Sunday January 10 th 2010 at 23:59) is plotted instead of the entire 10 years.Further details on the model estimates are available in Table5.

Figure 9 :
Figure 9: Median duration (in milliseconds) and median number of iterations together with 2.5%-and 97.5%-quantile when fitting two-state Poisson HMMs to 200 replications in the first setting.

Figure 10 :
Figure 10: Median duration (in milliseconds) and median number of iterations together with 2.5%-and 97.5%-quantile when fitting two-state Poisson HMMs to 200 replications in the second setting.

Figure 11 :
Figure11: Median duration (in milliseconds) and median number of iterations together with 2.5%-and 97.5%-quantile when fitting two-state Gaussian HMMs to 200 replications in the third setting.

Figure 12 :
Figure 12: Convergence counts of nlminb and a hybrid algorithm (first Nelder-Mead, then nlminb).The hybrid algorithm uses 1, 10, 20,. . .iterations for the Nelder-Mead, then passes the estimates to nlminb as initial values.We randomly picked 1000 sets of initial values out of 1458000 potential candidates.

Figs. 13 Figure 13 :Figure 14 :Figure 15 :
Figs. 13 to 15 show the median of the parameter estimates and the nll, along with 95% CIs from the three different simulation designs described in Section 4 across all optimizers studied, over 1000 realizations from the different models.All figures show that the different optimizers behave almost identically in terms of accuracy.However, we restate that the initial values used here are equal to the true parameter values.
The data were provided by the Kilts Center for Marketing, Graduate School of Business of the University of Chicago and are available in the database at https://research.chicagobooth.edu/kilts/marketing-databases/dominicks.The data are displayed in Figure3.Similarly to the previous section, we fitted multiple Poisson HMMs, and the preferred HMM in terms of AIC and BIC is a two-state model as well (see Table4in Appendix D).Figure4displays the smoothing probabilities resulting from this model.The patterns of the CIs described in the previous section are, in principle, also present here.Nevertheless, an exception is highlighted in the bottom panel of Figure4: at t = 152, the smoothing probability of State 2 is closer to the upper boundary of one than the corresponding probability at t = 147.Yet, the uncertainty is higher at t = 152.

Table 1 :
Performance of multiple optimizers estimating Poisson HMMs from the TYT data (first setting) over 7371 different sets of initial parameters.The first column lists all optimizers.The second column shows how often optimizers fail to converge successfully.The third column displays how often optimizers successfully found the global maximum of the nll when converging.

Table 2 :
Performance of multiple optimizers estimating Poisson HMMs from the second setting over 194481 different sets of initial parameters.The first column lists all optimizers.The second column shows how often optimizers fail to converge successfully.The third column displays how often optimizers successfully found the global maximum of the nll when converging.

Table 3 :
Performance of multiple optimizers estimating Gaussian HMMs from the third setting over 200000 different sets of initial parameters randomly drawn from 39066300 different candidate sets of initial values.The first column lists all optimizers.The second column shows how often optimizers fail to converge successfully.The third column displays how often optimizers successfully found the global maximum of the nll when converging.

Table 5 :
). Model estimates are derived by nlminb with TMB's gradient and Hessian functions passed as arguments.Estimates, Wald-type 95% CIs, for a five-statePoisson HMM fitted to the hospital data.From left to right, the columns contain: the parameter name, parameter estimate, and lower (L.) and upper (U.) bound of the corresponding 95% CI derived via the Hessian provided by TMB, repeated for the three HMMs.As standard deviations and Poisson rates must be non-negative, the CIs are adjusted when necessary.Similarly, the TPM elements and the stationary distribution must take values between 0 and 1 and the CIs are adjusted accordingly when necessary.AIC and BIC scores are displayed under for corresponding models with two, three, four, and five states (when the model estimation converges).Model estimates are derived by nlminb with TMB's gradient and Hessian functions passed as arguments.