Robust and Consistent Estimation of Generators in Credit Risk

Bond rating Transition Probability Matrices (TPMs) are built over a one-year time-frame and for many practical purposes, like the assessment of risk in portfolios or the computation of banking Capital Requirements (e.g. the new IFRS 9 regulation), one needs to compute the TPM and probabilities of default over a smaller time interval. In the context of continuous time Markov chains (CTMC) several deterministic and statistical algorithms have been proposed to estimate the generator matrix. We focus on the Expectation-Maximization (EM) algorithm by Bladt and Sorensen (2005) for a CTMC with an absorbing state for such estimation. This work's contribution is threefold. Firstly, we provide directly computable closed-form expressions for quantities appearing in the EM algorithm and associated information matrix, allowing to easily approximate confidence intervals. Previously, these quantities had to be estimated numerically and considerable computational speedups have been gained. Secondly, we prove convergence to a single set of parameters under very weak conditions (for the TPM problem). Finally, we provide a numerical benchmark of our results against other known algorithms, in particular, on several problems related to credit risk. The EM algorithm we propose, padded with the new formulas (and error criteria), outperforms other known algorithms in several metrics, in particular, with much less overestimation of probabilities of default in higher ratings than other statistical algorithms.


Introduction
Credit ratings play a key role not just in the calculation of a bank's capital charge (amount of capital a bank must hold) but also are typically a requirement for corporations wishing to issue bonds. There are different agencies which provide firms with a rating and the credit rating the agency gives a company determines in some respect the financial health of the company. Typically ratings are of the form A A A, A A, A, B B B, B B, B, C, D (although it varies between agencies) with, A A A the best (safest), C the worst (riskiest) and D to imply the firm has defaulted. It is also standard for banks to use their own internal ratings system (see Yavin et al. 2014). For an overview of the 'science' involved in the rating procedure (see Cantor 2004).
The main object of interest in this work is the so-called annual Transition Probability Matrix (TPM), it is a stochastic matrix which shows the migration probabilities of different rated companies within a year. Rating agencies produce these annually. It is possible that such matrices are not initially stochastic due to company mergers or rounding for example.
bonds. This is compounded by the known fact, corroborated by our numerical experiments, that certain algorithms overestimate the PD. Our methodology yields a way to obtain pointin-time (PIT) estimates of the Probability of Default (PD) from the through-the-cycle (TTC) estimates.
Credit rating models within the Markovian framework are handy both from a theoretical and numerical perspective. Evidence is given in Lando and Skodeberg (2002) that such Markovian structure is not true in practice, nonetheless, within the Markovian structure, efficient implementation of apt Markovian credit risk models and related risk measuring estimations able to deal with massive portfolios is a challenging problem (see Brigo et al. 2014, Rutkowski andTarca 2015). There have been several models that produce non-Markovian effects such as mixing two generators (Frydman and Schuermann 2008) or considering hidden Markov models (Korolkiewicz 2012), see also Long et al. (2011). All non-Markovian models require in one way or another access to additional data for accurate calibration. These data are costly, need to be updated over time and many companies opt to deal only with the TPMs. This work focuses on practitioners that do not have access to the data. The issue of rating momentum will be dealt with in forthcoming research.

The problem at hand
We take the view of a financial agent who wishes to estimate probabilities of default or assess risk in his portfolio due to credit transitions but does not have access to (the expensive) individual credit rating transitions. The agent only has the annual TPM, say P(1), and uses a continuous time Markov chain (CTMC), say (P(t)) t≥0 , with a finite state space to model the changes in rating over time. Under standard conditions the evolution of the CTMC can be written asP(t) = e Qt where Q is the generator matrix. The problem is then to estimate Q given P(1). This estimation is non-trivial due to the so-called embeddability problem (not reviewed here). It is discussed in great detail by Israel et al. (2001) and, for more of the mathematics and many of the existing results on the embeddability problem, we point the reader to Lin (2001).

Contribution of this work
(i) We provide sufficient conditions to extend the convergence result of Bladt and Sorensen (2005) to individual parameters rather than just convergence of likelihoods. The conditions presented are trivially satisfied in the context of the TPM problem.
(ii) We derive closed form expressions for the entries of the Hessian of the likelihood function used in the EM algorithm. This eliminated several instability issues appearing in other numerical implementations found in the literature and allows for computational speedups (comparatively). Moreover, the result provides a way to estimate the error of the estimation and assess the nature of the stationary point the algorithm has converged to. (iii) We give a short overview of known methods and implement them with some modifications as to improve their performance. See Sections 3 & 4 for precise meanings: we apply the algorithms to certain credit risk problems and carry out a simulation study to check the impact in the computation of risk charges, namely IRC (Incremental Risk Charge) with VaR (Value at Risk), IDR (Incremental Default Risk) with VaR and IRC with ES (Expected Shortfall). We distinguish portfolio types (mixed, investment or speculative); the impact of different types of generators (stable vs unstable); dependence on the sample size and general convergence. We compare probabilities of default as maps of time across different algorithms and find interesting results.
For the study carried out, the implemented EM algorithm is very competitive. It is slightly slower than the deterministic algorithms but much faster than the MCMC algorithms. It embeds statistical properties like robustness that deterministic algorithms cannot capture.
Remark 1. 1 We focus purely on continuous over discrete time models. Continuous time algorithms yield robust estimators while the discrete ones do not, with robustness understood in the following sense: from P(1) estimate P(0.5) and P(0.25). From P(0.5) estimate P(0.25) again. Continuous algorithms yield the same P(0.25), discrete algorithms (in general) will not. This work is organized as follows. In Section 2, we present the EM algorithm and we state our main theoretical findings. In Section 3 we briefly present other known algorithms and in Section 4 we present the benchmarking results.

The EM algorithm
There exists extensive literature on the majority of the algorithms we present in this paper, therefore, we only provide brief discussions and include references for additional information. Further, we will use the theory of Markov chains extensively. We do not provide details of the theory, however, interested readers can consult texts such as Norris (1998).

Preliminaries and standing convention
Throughout this manuscript, we consider companies defined on a finite state space {1, . . . , h}, where each state corresponds to a rating. We denote A A A as rating 1 and D (default) as rating h. We adopt the standard notation that P is an h-by-h stochastic matrix, which will be the observed TPM (at, say, time t = 1) and Q is an h-by-h generator matrix. We further denote by P i j := (P) i j , by q i j := (Q) i j and the intensity of state i by q i = j =i q i j where i, j ∈ {1, . . . , h}. A standard assumption used in credit risk modelling is that default is an absorbing state, hence P hh = 1. We work with infinitesimal generators of the following class.
Definition 2.1 [Stable-Conservative infinitesimal Generator matrix of a CTMC] We say a matrix Q is a generator matrix if the following properties are satisfied for all i, j ∈ {1, . . . , h}: If a matrix Q satisfies the above properties, then for all t ≥ 0 the matrix P(t) := e Qt is a stochastic matrix, where e A is the matrix exponential of matrix A (Norris 1998, p. 63). The goal of the algorithms presented is to calculate a generator matrix Q such that e Qt is the best fit to the observed TPM, where t denotes the length of time between the rating updates (typically one year).
Throughout let (X (t)) t≥0 denote a CTMC over the finite state space {1, . . . , h} with a generator Q of the above class. Associated to X (t) is, for i, j in the state space, K i j (t) the number of jumps from i to j in the interval [0, t] and by S i (t) the holding time of state i in the interval [0, t].
Remark 2.2 If a matrix P is embeddable, † the algorithms below are pointless and one can easily tackle the problem through eigenvalue decomposition, etc. Or in the case where the exact timing of rating transitions are known one can use the standard maximum likelihood estimator as in Jarrow et al. (1997). In our examples the only data given is a set of yearly TPMs which in general are not embeddable and the methods just mentioned do not yield useful results.

The algorithm
Many methods have been developed in statistics in order to calculate the maximum likelihood estimate, but many methods break in the presence of missing data. Mathematically, we are interested in some set X , but we are only able to observe Y, with the assumption there is a many-to-one mapping from X to Y. That is, X is a much richer set than Y. When dealing with such a case, the Expectation Maximization (EM) algorithm often offers a robust solution to the problem. McLachlan and Krishnan (2007) provide a complete overview of the algorithm.
The basis of algorithm is, we observe data y which is a realization (element) of Y. We know y has density function g (sometimes referred to as a sampling density) depending on parameters in some space , but we want the density (likelihood) of X (y). Hence, postulate some family of densities f , dependent on , where f corresponds to the density of the complete data-set X (y) (the set of points x ∈ X which are in the pre-image of y ∈ Y). The relation between f and g is, †In this setting a stochastic matrix P is embeddable if there exists a generator Q such that P = e Q .
The idea is, the EM algorithm maximizes g w.r.t. , but we force it to do so using the density f . Further, define, where E [·|y] is the conditional expectation, conditional on y under parameters . We assume R to exist for all pairs ( , ), in particular we assume f (x; ) > 0 almost everywhere in X for all (otherwise the logarithm is infinite). Let us clarify, f is calculated using , but the expectation is calculated using . The EM algorithm is then the following iterative procedure.
(iii) M-step: Choose ( p+1) to be the value of ∈ that maximizes R( ; ( p) ). (iv) Check if the predefined convergence criteria is met, if not, take p = p + 1 and return to (ii).

The particular problem of generator estimation.
For our problem the observed process is a discrete time Markov chain (DTMC), the unobserved process to estimate is a continuous time Markov chain (CTMC). Therefore, the observed data is the discrete transitions and the parameters we wish to estimate are the entries in the generator. The likelihood of a continuous time fully observed Markov chain with generator Q is given by the following expression (see Küchler and Sorensen 1997, Chapter 3.4), with K and S the same as before (immediately before Remark 2.2). Hence, given two generators Q , Q, the function R in (2.1) is, where y denotes the discrete time observations. Maximizing for q i j in R(Q ; Q) yields We follow an approach similar to Bladt and Sorensen (2005) (see also Bladt et al. 2002) but express the result in a framework more suited to the problem of generator estimation from TPMs, rather than the estimation from individual movements. Furthermore, the result derived in Bladt et al. (2002) is for irreducible Markov chains making it not applicable to our case (CTMC with absorbing states), accounted for in Proposition 2.4.
Consider the following functions (see Bladt et al. 2002), for where we denote by c = (c 1 , · · · , c h ) ∈ R h and Z ∈ R h×h with Z ii = 1 for i ∈ {1, · · · , h}. Observe that V * i j is the Laplace-Stieltjes transform of the holding times S and the probability generating function of the jumps K , with initial and final states X (0) = i and X (t) = j respectively. Denoting by V * (c, Z; t) the h-by-h matrix of elements V * i j (c, Z; t). This allows us to give the main theorem (similar version) in Bladt et al. (2002). Consider a CTMC X that we observe at n time points 0 ≤ t 1 < t 2 < · · · < t n and denote by y s the state of the Markov chain at time t s , i.e. y s := X (t s ). Then, the expected jumps and holding times over the observations are, Proof. Observe that V * in (2.4) satisfies the differential equation in the statement of Theorem 2.3 (see Bladt and Sorensen 2005). The proof is omitted as one can solve the equation by employing the methods in Van Loan (1978).
Thus we obtain closed form expressions for the two key quantities appearing in (2.3). This approach differs from Bladt and Sorensen (2005) where they describe numerical schemes to solve the differential equations, namely Runge-Kutta and uniformization. These techniques can yield good results at this †The Shur product of two h × h matrices A and B is the h × h matrix with elements A i j B i j .
level, but our closed form expression will pay dividends when it comes to error estimation.
This yields the relation we desire, however, in our example we have an observed TPM (or sequence of TPMs), P, in the case of equal observation windows, t in the interval [0, T ] (although it is trivial to generalize) the expectation can be expressed as, where N = T /t (the number of observations) and P u is the TPM of the u-th observation.
Remark 2.5 [The reducible case] Previously, we only had observed transitions, hence they must have a non-zero probability of occurring. Here we can sum s and r over the full range because P sr (t) acts as an indicator of possible transitions, that is, if P sr (t) = 0 we set the s, r component as 0. Clearly, if P sr (t) > 0, but (exp{Qt}) sr = 0, Q is misspecified.

Likelihood Convergence of the EM algorithm.
In the case of this problem Bladt and Sorensen (2005) provide a proof that the likelihood function converges with one small caveat in order to keep the parameter space compact. Namely, they use the following constrained parameter space, Q , which can be achieved by setting, Q = {Q ∈ Q|det[exp(Q)] ≥ } (Q is the parameter space from Definition 2.1) for some > 0. Theorem 4 in Bladt and Sorensen (2005) states that the algorithm will converge to a stationary point of the likelihood or hit the boundary of the parameter space they have induced. It is accepted this is a crude approach to solving the problem and further analysis is needed when det[exp(Q)] = . An alternative approach would be to use a penalized likelihood as discussed in McLachlan and Krishnan (2007, p. 214).

Parameter convergence criteria.
The above convergence is sufficient for one to conclude convergence of the likelihood. However, it is not sufficient for convergence of the parameters as one cannot state that the series of iterates . From a theoretical standpoint this may not be as important as convergence of the likelihood itself, nonetheless, it is of key importance for applications. For instance, without convergence of the parameters the risk charge different financial agents obtain from the same data may vary wildly, even under very strict convergence conditions. Before proving convergence we require two important points.
Remark 2.6 With (2.6) in mind we assume that for any s = r such that P u sr (t) = 0 for all u, we take the starting point q sr := (Q (0) ) sr = 0. As discussed in Bladt and Sorensen (2005), any point set to zero will stay at zero for all iterations. Note, we are not changing the problem since these terms will converge to zero under the EM algorithm.
We denote the space of generator matrices which satisfy this condition as .
The above assumption ensures non-zero entries in the tridiagonal band and also only finite entries as one can take as small as we wish. In the case of TPMs associated to credit ratings, such an assumption is trivially satisfied as one generally has diagonally dominant matrices and companies can always be upgraded or downgraded by one, thus P u i,i±1 are typically non-zero. Diagonal dominance is sufficient for the generator to be identifiable and therefore entries do not blow up, we discuss the notion of identifiability in Section 2.3.
Proving the parameters converge is more challenging than the likelihoods, however, Wu (1983) provide a sufficient condition for this to occur, namely a sufficient condition for An example of a forcing function is σ (t) = λt 2 where λ > 0. We require the following bounds on the expected values to show convergence.
Lemma 2.8 Let N and P u be as defined in (2.6) and assume for i = j there exists a u ∈ {1, . . . , N } such that P u i j > 0 (we observe a movement from i to j in observation window u). Then we obtain the following bounds on the expected number of jumps: Moreover, assuming there exists a u ∈ {1, . . . , N } such that P u ii > 0, we obtain the following bound on the expected holding time, To maintain the flow of the text we state immediately our main convergence result, and defer the proof of the Lemma to Appendix A.1.
Theorem 2.9 Under Assumption 2.7, then, there exists a λ > 0 such that for all EM iterations k ∈ N, where || · || is the Euclidean norm. †A forcing function is defined as any function σ : Proof. Writing out the difference in the R terms we obtain, Due to the form of the Euclidean norm squared and the function R, it is sufficient to show the inequality holds for all i = j. Namely, it is sufficient to show the existence of a λ > 0 such that, for all i = j. We tackle the log terms first. It is well-known that we can express any C ∞ -function using Taylor expansion to a finite number of terms with some error (remainder) term. Moreover, the error term has a known form and hence, using an exact Taylor expansion to second order, there exists a Z ∈ [min(q into the LHS of (2.9), the condition simplifies to, In order to show this bound we need to get a handle on Z . Clearly, there are two options between iterations, either q In the latter case we obtain, Since we can element wise bound Q (k) , using Lemma 2.8 and Assumption 2.7 we can bound the term from below by constants (depending on ). Hence, we can choose a λ independent of k such that the condition is satisfied.
, follows a similar argument. Again, we can set Z as the larger of the two values, thus we obtain the following inequality, Using Lemma 2.8, we can reduce this inequality to, Since P u i j > 0 and we can bound each q i j from above, again we choose a λ independent of k.

Starting value for the EM algorithm.
The final point to discuss, is the choice of the initial matrix Q. It is useful from a computational point of view to start in a good place. Here we choose Q based on a generalization of the QOG algorithm (described in Section 3.1) that allows for complex inputs. For each entry q i j we define the input as, where |q i j | is the magnitude of q i j and Re(q i j ), is the real component of q i j . With the newly defined Q we apply the QOG algorithm. We take any zero entries not in the final row to be a small number (10 −5 , say) unless there are zero observed transitions. This defines our initial choice of Q. We define the EM algorithm steps as, (i) Take an initial intensity matrix Q and positive value . (ii) While the convergence criteria is not met and all entries of Q are within the boundaries return to E-step.
(iii) End while and return Q.
This leads to the following theorem for convergence of the EM.
Theorem 2.10 [Convergence of the EM] Assume that our initial point is in the parameter space : is a true generator and satisfies Assumption 2.7. Then either the sequence of points {Q (k) } k converges to a single point in which is also a stationary point of the likelihood, or the entries go to the boundary (blow up or some tri-diagonal elements in an nonabsorbing row go to zero).
A proof of Theorem 2.10 follows directly from Theorem 4 in Bladt and Sorensen (2005) and our Theorem 2.9.
Remark 2.11 [The unique maximizer of the Likelihood] The natural question one may ask is does this stronger form of convergence imply convergence to the global maximum? The problem of existence and uniqueness of maximum likelihoods in this setting is a very challenging problem with a long history. Bladt and Sorensen (2005) give a wonderful overview on the subject, Theorem 1 in Bladt and Sorensen (2005) also provides results on existence and uniqueness of the maximum. Unfortunately, one cannot say more than this, if one can derive conditions under which a unique maximum existed (for non-embeddable TPMs) then the above convergence result is sufficient to conclude the EM will converge to the MLE.
During the final revision of this manuscript, Pfeuffer et al. (2017) came to our attention. There, the authors propose two algorithms to mitigate the effect of the EM's possible convergence towards a local but not necessarily the global maximum of the likelihood function (no proofs are given). Our Theorem 2.9 is handy in this context as it shows that once the EM lands 'near' the global maximum the iteration will converge to it.

Variance estimation
In this section, we derive an expression for the Hessian of the likelihood. We use a result in Oakes (1999) and follow Bladt and Sorensen (2009), however, unlike Bladt and Sorensen (2009), we provide a closed form expression for the Hessian. This result eliminates the stability problems observed in the numerical simulation case when the entries in Q are small. The Hessian provides a way to estimate the standard error of the maximum likelihood estimates and further allows us to assess the nature of the converged stationary point (this is further discussed in Section 4.4.1).
We point the reader to Bladt and Sorensen (2005, Theorem 1) for results on the existence and uniqueness of maximum likelihood estimators with respect to this problem. Further, for discussions on consistency and asymptotic normality related to this problem one should consult Weißbach 2013, Kremer andWeißbach 2014). Kremer and Weißbach (2013), provide sufficient conditions for consistency, the key assumption relies on so-called model identifiability. † Kremer and Weißbach (2013) prove identifiability under conditions which are too restrictive for our purpose; Sorensen 2005, Dehay andYao 2007) discusses the problem of identifiability in detail. From Cuthbert (1973); Bladt and Sorensen (2005) for the model to be identifiable it is sufficient (though very crude) to have min i (exp{Qt}) ii > 1/2, Culver (1966) gives a requirement for general matrices based on the eigenvalues which one can always aposteriori verify after a Q is deduced. The crucial assumption in Kremer and Weißbach (2014) to obtain asymptotic normality, is that the Hessian must be invertible at the true value, we can of course verify invertibility a posteriori.
We recall a result from Oakes (1999) for calculating the Hessian of the likelihood.
Lemma 2.12 The second derivative of the likelihood with parameter and observed information y is related to the EM function R by Injecting (2.2) in the above we obtain, where δ ab is the Kronecker delta. From our previous work, (2.10) is easy to obtain, however, (2.11) involves derivatives of the expected jumps and holding times and is thus challenging. Bladt and Sorensen (2009) opt for a simple numerical scheme to compute these derivatives and found unstable results, although the authors do remark that more sophisticated numerical schemes could yield improved results at greater computational expense. †In our setting a model is identifiable if there are no two generators Q = Q such that exp{Qt} = exp{Q t}.
It is worth noting we have made no comment on the allowed values of α, β, μ and ν, other than the clear restriction that they belong to {1, . . . , h}. Let us now state the following definition.
Definition 2.13 [Allowed pairs] We say that the pair α, β is allowed if α = β (not in the diagonal) and q αβ is not converging to zero under the EM algorithm.
For practical applications, one can imagine the set of allowed values, as the set of α, β such that q αβ > , where is some cut-off point (10 −8 , say). The reason we must exclude small parameters is, this analysis only holds in the large data limit, since we do not have an infinite amount of data we cannot for certain rule out some jump, however, if q αβ is converging to zero, it implies that this parameter is either zero or extremely close to zero, and therefore, we can bound it above by a small number. Moreover, from a mathematical point of view a parameter which does tend to zero (or even becomes zero) lies on the boundary, where the notion of differentiability is not clear. Therefore, we can think of the 'allowed pairs'as the variables when solving the problem in a restricted parameter space. We now present the following theorem.
Remark 2.15 In the above representation for the derivative of R, we use subscripts of the form h + y s+1 and 3h + y s+1 , this is simply a consequence of the result in Van Loan (1978). Namely, we are not interested in all the entries of the matrix, only an h-by-h segment. We therefore need to adjust the indexing to only take elements at this specific segment.
Using Theorem 2.14 and Lemma 2.12, we can write the elements of the Hessian corresponding to the q αβ q μν derivative as, A similar transform to (2.6) can be applied here to obtain the Hessian from TPMs. When using the result to estimate the error, some knowledge of the number of companies per rating is required.

Computation of the error.
Due to the Hessian only being defined for parameters q αβ > 0, some parameters are not included in the Hessian, thus the matrix is smaller than (h − 1) 2 -by-(h − 1) 2 . We compute the Hessian as follows, • Let N a be the number of allowed points in the estimated Q returned from the EM algorithm. • Define an N a -by-2 matrix V Q as the matrix which records the allowed (in the sense of Definition 2.13) components of Q. The i jth component of the Hessian is then the differential, .
• If we denote the Hessian by the N a -by-N a matrix H(·), then the information matrix is given by −H(·). The estimated variance of the allowed parameter q ab is then the ith diagonal element of −H(·) −1 , where V Q (i, 1) = a and V Q (i, 2) = b. • Following standard statistics, the normal based 95% confidence interval for q ab is q ab ± 1.96 √ V ar(q ab ).

Deterministic algorithms
3.1.1. Diagonal adjustment (DA). The first method to discuss is diagonal adjustment, see Israel et al. (2001). Given a TPM, P, one calculates the matrix logarithm directly. However, due to the embeddability problem, the logarithm may not be a valid generator. To solve this problem Israel et al. (2001) suggest setting for i = j, and adjusting (re-balancing) the diagonal element correspondingly, q D A ii = j =i −q i j for i ∈ {1, · · · , h}. Hence forcing the corresponding matrix Q D A to satisfy the properties of a generator.

Weighted Adjustment (WA).
Weighted adjustment is also suggested in Israel et al. (2001). It follows diagonal adjustment except, one re-balances across the entire row. Again, calculate the logarithm of the TPM to find q's, then compute The entries corresponding to weighted adjustment are defined as,

Quasi-Optimization of the Generator (QOG).
The above two methods are unfortunately not optimal in any sense. The QOG (Quasi-Optimization of the Generator), method suggested in Kreinin and Sidelnikova (2001) relies on optimization and is therefore an improvement on the diagonal and weighted adjustment methods. QOG involve solves the minimization problem min Q∈Q Q − log(P) , where Q is the space of stable generator matrices and || · || is the Euclidean norm. Further, Kreinin and Sidelnikova (2001) provide an efficient algorithm to obtain Q.

Statistical algorithm: Markov Chain Monte Carlo
An alternative statistical algorithm one can adopt is MCMC (Markov Chain Monte Carlo). For the reader's convenience we have included a summary of MCMC inAppendix 2. It should be noted that all MCMC algorithms presented here use a so-called auxiliary variable technique, by introducing the fully observed Markov chain, X as a random variable. Moreover, the prior for Q is (α, 1/β) (shape and scale), which is conjugate for the likelihood of a CTMC.

Gibbs sampling -Bladt & Sorensen 2005.
To simulate the Markov process, X , Bladt and Sorensen (2005) suggest a rejection sampling method. As is stated in Bladt and Sorensen (2005), such a sampling method runs into difficulties when considering low probability events since the rejection rate will be high (e.g. default for high rated bonds). The MCMC algorithm is summarized as follows, Inamura (2006), (i) Construct an initial generator Q using the prior distribution ( (α i j , 1/β i )).
(ii) For some specified number of runs (1) Simulate X for each observation from Y , with law according to Q.
(2) Calculate the quantities of interest K and S, from X . (3) Construct a new Q by drawing samples from (K i j (t) + α i j , 1/(S i (t) + β i )). (4) Save this Q and use it in the next simulation.
(iii) From the list of Qs, drop some proportion (burn in), then take the mean of the remainder.
The issues with this method are the choice of α and β and the number of runs required before we know that the sample has converged (burn in). Both of these are critical in obtaining accurate answers from MCMC and although Bladt and Sorensen (2005) suggested taking α i j and β i to be 1, they observe MCMC overestimating entries in the generator when true entries were small. Furthermore, here we are required to use the TPM indirectly through inferring company transitions. That is, we consider M companies in each rating and define the number of companies to make the transition i to j as M × P i j , this of course need not be an integer, but we can always normalize the entries. The reason we cannot use the TPM directly as we did in the EM algorithm is due to the fact that MCMC becomes very sensitive to the values in the prior. The burn in for MCMC will be of little concern to us here as will become apparent when carrying out analysis on the algorithms.

Importance sampling -Bladt & Sorensen 2009.
Bladt and Sorensen (2009) address some of the issues in Bladt and Sorensen (2005) by running the same algorithm as previous combined with an importance sampling scheme based on the Metropolis-Hastings algorithm (in its essence a single component Metropolis-Hastings algorithm). The proposal distribution suggested is a Markov chain with generator given by the 'neutral matrix' Q * , which takes the following form, where 1 h and I h is the h-by-h matrix of ones and identity matrix, respectively, and W is a scaling factor set to match the intensities in the true generator matrix Q. Bladt and Sorensen (2009) note that if entries in Q are known to be zero, then the corresponding element in Q * should also be set to zero and the diagonal modified accordingly. Thus transitions rarely produced by the generated Markov chain will occur much more frequently under Q * . Thus we have solved (at least partially) one of the problems faced in MCMC. The importance sampling weights for a chain X are, where L is the CTMC likelihood. For the priors, Bladt and Sorensen (2009) do not suggest any significant improvement on their earlier work. The authors use α = 1 and β = 5, which they claim gives better results than the suggestion in Bladt and Sorensen (2005). However, it still provides a problem when dealing with entries in Q which are close to zero. The problem stems from the fact that very little information is known (rarely observed) for certain transitions, therefore, the output for these entries is mostly based on our prior beliefs. Inamura (2006) presented an alternative algorithm to the original MCMC algorithm presented in Bladt and Sorensen (2005), whereby one calculates the mode rather than the mean. The author claims that this gives extremely accurate results and outperforms other algorithms. The reasoning presented is that the standard MCMC overestimates in the small probability cases due to the gamma distribution being 'skewed', therefore the mode is a better estimate. Inamura (2006) approximates the mode of {q (k) i j } by kernel smoothing over the estimates (after taking the log transform to ensure all results are positive).

Remark 3.1 [Other MCMC-based estimators] Many extens-
ions and different MCMC methods to solve this problem are possible (e.g. priors as hyperparameters or sequential Monte Carlo techniques). Here, we consider less complex MCMC algorithms which already set the tone for a comparative study.

Benchmarking the algorithms
Due to the diversity of investments bank's make, one cannot assess an algorithms' performance with a single test. With this in mind we consider a host of tests on different portfolios and matrices. The computations were carried out on a Dell Pow-erEdge R430 with four Intel Xeon E5-2680 processors. During the review process of our work we found Pfeuffer (2017) with an R-implementation of some of the algorithms covered in the previous section. The performance tests of Pfeuffer (2017) are a just subset of those we present next and independently confirm (where there is overlap) our findings, in particular the timing of the MCMC algorithms versus the EM. A version of our algorithms will appear in the mentioned R-package (see Remark 1.2).
The first observation we make is, transition matrices can vary substantially depending on the financial climate (see Christensen et al. 2004 andCantor 2004). Therefore, we consider two different generator matrices which can be thought of as the generator in financial stress and the generator in financial calm. In order to keep these matrices 'reasonable' we start off with the generator given in Christensen et al. (2004) built using a large amount of data (see also Inamura (2006)) and consider a generator which has in general higher transition rates and one with lower transition rates. Through considering more than one generator this provides a more detailed assessment of the performance of the various algorithms than other comparative reviews, such as Inamura (2006). The generators we consider are shown in tables 1 and 2. We observe that table 1, has more non-zero entries and larger entries than that of table 2.
Throughout the analysis we refer to the multiple MCMC algorithms introduced in Section 3 which we label in the following way: MCMC BS05 is Bladt and Sorensen (2005)

Sample size inference
The first test we consider is an extension to a test in Inamura (2006), where the author considers a true underlying generator and masks it using it to simulate TPMs, which we view as observations, then applying the algorithms to each observation.
The key point here is, Inamura (2006) only simulates 100 companies per rating and hence the outputted TPM is nonembeddable (has 0 entries for accessible jumps). This is an extremely useful test because it provides a fair and intuitive way to assess the performance of each algorithm, however, Inamura (2006) only considers one true generator and only one level of information i.e. 100 companies per rating. Alongside the two different generators we also consider a range of companies per rating to determine its effect on convergence for each algorithm. Furthermore, Inamura (2006) uses seven years worth of data, although one would likely have access to multiple years worth of TPM data, it is unlikely that we would have seven years of transitions from the same generator. Hence we consider four years, which is more consistent with time homogeneity estimates for generators (see Christensen et al. 2004). We calculate our estimates for the generator as follows.
(i) Take a range of obligors per rating, [100,200,300,500,750,1000] and 10 random seeds. (ii) For each true generator simulate four one year TPMs for each seed and for each obligor per rating. Hence we have (#Years×#Obligors categories×#Random Seeds×#True generators), simulated TPMs. (iii) For each set of four simulated TPM we estimate the generator for each algorithm. MCMC may take a long time to run, therefore we consider the time taken to carry out the first 10 runs and the total time taken, if these exceed 180 or 18 000 seconds, respectively, the algorithm is deemed to be too slow and no result is returned. Note, MCMC algorithms use 3000 runs with a burn in of 300. This is smaller than Inamura (2006), for example, however, Inamura (2006) shows apparent convergence to the stationary distribution in a small number of iterations and we observe a similar result. (iv) Therefore, for each algorithm we have (# Obligors categories × # Random Seeds × # True generators) estimated generators to analyse.
We analyse the estimated generators by considering, distance between estimated generator and true generator in Euclidean norm and difference in one year probability of default. All results presented have been obtained by analysing the estimated generator for each seed, then averaging. This gives a better picture of the average performance.

Convergence in euclidean norm.
Our goal in this analysis is to consider the empirical rate of improvement of each algorithm as our 'information' about the true generator increases. For each obligor category we calculate the natural log of the distance (measured by the Euclidean norm) between the estimate and the true. The results are shown in figures 1 and 2.
Note the x-axis is on a logarithmic scale. We observe similarities between the two figures, most notably in the case of low information all algorithms have very similar convergence results,  however, as we increase the information there is substantial variation in improvement, MCMC BS09 algorithm does not improve as well as the other algorithms. Missing points stem from an algorithm failing the acceptance times. The MCMC algorithms have a potentially increased error due to the Monte Carlo simulation, lowering it requires a larger computational expense to the already most expensive algorithm being tested here. For the Bladt and Sorensen (2009) algorithm, the neutral matrix approximation may give poor mixing, thus the additional error.

Error in probability of default.
Although overall error is important, it does not provide details on the small probability scale. This is extremely important in banking, since estimation of the probability of default is crucial. Using the Obligors per rating same estimated generators as previous we calculate the corresponding one year TPM, that is, we calculate exp{Q estimate } (using the expm function in MATLAB) for each seed then take the average. The averaged TPM default probabilities are compared to the true ones. To keep the numbers in the comparisons meaningful we plot the log of the relative error, where we define, The results of which are given in Figures 3 and 4. Unlike the overall error, there appears to be far greater volatility in the error estimation w.r.t. the probability of default.
Moreover, there appears to be no general downward trend in error for the investment grade ratings. A likely cause for this is, even with 1000 companies there are still no/few investment grade defaults. Of the algorithms MCMC BS09 performs the worst. The EM algorithm though has consistently one of the smallest errors and is clearly the best in the investment grades. We have only shown the results for the unstable generator, the stable generator was similar.

Time dependent probability of default
A key question that has not been addressed in the literature is how do probabilities of default change in time among the several algorithms. For this, we only consider EM, QOG, WA and the MCMC Mode algorithm from Inamura (2006), since these algorithms gave the best probability of default estimates. We consider a non-embeddable TPM, then estimate the generator matrix Q, from Q we can easily calculate the probability of a company with some initial rating defaulting in time t > 0. The goal here is to assess how that probability changes with time. The TPM is given in table 4, for the MCMC algorithm we took this table to be generated with 250 obligors per rating.
The probability of default across ratings over the one year time horizon is found in Figure 5. The plots give a deeper understanding to the algorithms themselves. As the probability of default increases the algorithms converge, however, in the case of less defaults we observe a much larger discrepancy. This can be thought of as the algorithm's ability to deal with missing data, in the lower grades we observe defaults and thus have a handle on the probability, however, in the case of AAA ratings we observe no defaults, and therefore, it is an approximation by the algorithm. This shows the difference between the methods, shows the potential prior dependence in the MCMC algorithm. What is also extremely interesting is that QOG set the jump in the generator from AA to C as zero (even though the TPM has a non-zero entry there), this implies QOG may in some places under estimate the risk for the investment ratings, this can be seen by the fact QOG puts a smaller probability of default on AAA.
To assess the performance of each algorithm we measure the error by the following, where Risk Charge Estimate(i) is the ith realization of the risk charge and N is the number of TPM sets (10 here). The results obtained by the algorithms are shown in table 9. There is a clear overestimation of the probability of default at higher grades by the WA and MCMC algorithms.

Risk charge
The previous tests have been rather theoretical; we now consider a practical test to assess the performance of these algorithms in calculating risk charges. We do not give much discussion to the calculation of these risk charges for more technical details readers should consult texts such as Skoglund and Chen (2011). Here we consider multiple stylized portfolios to represent the risk appetites of different banks. To best of our knowledge analysis into how different risk measures react to different portfolio types has not been considered in the literature. The risk charges we consider are IRC (VaR at 99.9% with a three months liquidity horizon including mark to market loss), IDR (VaR at 99.9% over one year only considering default) and a theoretical risk charge which is IRC but measured using Expected Shortfall (ES) at 97.5%. The final risk charge is included due to the Basel committee showing an increasing interest in ES. We consider four years worth of simulated data, and to keep the analysis realistic we consider 200 companies per rating. We consider three different portfolios corresponding to risk adverse (all investment grade), a speculative portfolio (all speculative grades) and finally a mixed portfolio. The portfolios considered are given in tables 5-7. The tables show the values and ratings of the various bonds in each portfolio.
Alongside these portfolios we calculate the risk charges using the following information, • The interest rates we receive for a bond in each rating are AAA AA A BBB BB B C 2.65% 2.69% 2.78% 2.93% 3.18% 5.45% 12.39% These figures are based on interest rates from Moody's and can be found in Section 4.1 of Skoglund and Chen (2011).Although these interest rates do not technically match the generators we are using for the TPMs they provide reasonable interest rates for our toy example.
• We assume that all money is lost in the case of default (zero recovery rate). • We calculate credit migration using the one factor † credit metrics model (Gupton et al. 1997), i.e. normalized asset returns follow, where X is the systematic risk, i is the idiosyncratic risk both standard normally distributed and β i is the correlation to the systematic risk, defined in Supervision (2003, p. 50), where P D i is the probability of default of asset i. Consequently we see that the higher P D i the lower the value of β.
• Although more sophisticated methods are available for calculation of VaR and ES (see Fermanian 2014), we calculate the risk charges using Monte Carlo. This is sufficient here since the portfolios are small relative to a typical bank portfolio, therefore we can obtain accurate estimates using a reasonable number of simulations. • Again, we calculate 10 realizations of the TPMs and estimate a generator for each.
We consider 15×10 5 simulations for each portfolio, to assess whether this was sufficient we calculated VaR and ES using 7.5×10 5 , 10×10 5 , 12.5×10 5 and 15×10 5 simulations and found the difference between 7.5×10 5 and 15×10 5 to be †This is technically not the true regulation for the calculation of IDR which requires a two factor model, however our goal here is only to use these calculations as a method for comparing algorithms. It should be noted, in the stable IDR some algorithms produce a non-zero value for the investment portfolio, therefore, we have inserted the money value. The first observation we make is, all algorithms overestimate the risk for the investment portfolio. This is down to two key feature, one is the 'step like' nature of VaR, where in a small portfolio, small probability changes can make a large difference. The other is because we are averaging over multiple Monte Carlo simulations, thus having one default in one of those realizations will change the overall average dramatically. In terms of a typical bank portfolio this type of error should not be a problem since we would be dealing with a far larger number of assets and hence one would obtain multiple defaults. However, the results do still give a useful comparison between the algorithms. Although the MCMC algorithms can outperform the deterministic algorithms for the speculative grades, remarkably in all categories the EM produces the best results. From the tests we have considered we conclude the EM to be the superior algorithm for this problem.

Error estimation of the EM algorithm
A major advantage of the statistical algorithms over their deterministic counterparts is that one can derive error estimates (confidence intervals) without the brute force (slightly ad-hoc) method of bootstrapping. For MCMC this comes by looking at the posterior distribution, which we get for free. However, as we have seen MCMC is computationally expensive and we have derived a relatively cheap formula to calculate the confidence intervals from the EM algorithm.
In a similar fashion to the analysis we have carried out previously we now test the error estimate given by the EM. Again, we mask the true generator using simulated TPMs, however, here we only consider the scenario of 300 obligors per rating, but the number of years worth of data is varied. That is, we simulate 50 years worth of TPMs and then apply the EM algorithm using 1 year worth then 2 years etc up to 50 years. This analysis shows both the estimated error for the parameter and also how the error changes when more information is added. It should also be noted that we replace companies who have defaulted to the rating they were pre default. This keeps the number of companies in the system constant and can be thought of as the flow of new companies being rated. Moreover, this is only one realization of the data, hence the parameter estimate and confidence intervals are not particularly smooth.
The transitions shown in Figures 6 and 7 were chosen to show a spectrum of the magnitudes in the generators, the other entries not shown are similar. The first point to make is that the true value of the parameter is almost always within the confidence interval and the confidence interval shrinks as the number of years increases. One of most important features   though is that the confidence interval is only small when the EM is stable and close to the true parameter, hence the EM is not 'over confident' but is also providing reasonable estimates on its own error. The final point to make is, although some confidence intervals go below zero by a small amount, this is only true in the case where the parameter is extremely close to zero initially and further, once more data is considered, all parameters have confidence interval which are strictly positive. Figures 6 and 7 show a very important feature. Namely, how much the estimate can vary with data, especially when the parameter is reasonably small. Such analysis exposes the variance in the estimate and how much data are actually required before the estimate levels out. From this example though the confidence intervals calculated from the information matrix appear to be able to capture this error. In the view of future regulation it may be prudent to take such confidence intervals into account when considering risk charges.

Connection to the global maximum.
Aprevious problem with the EM was one could not be sure of the nature of the stationary point. However, we know the form of the Hessian, and therefore, we can easily check if this point is a maximum by assessing the eigenvalues of this matrix. Clearly, if we were not at a maximum, then it would be worth perturbing the outputted generator and rerunning the algorithm. As discussed in Remark 2.11 the question of a global maximum is very difficult in this setting.
Remark 4.2 One way that has been suggested to improve the chances of the EM converging to the global maximum is, to start from multiple points. Here we can consider creating starting points by setting for each i = j, q i j ∼ Exp(λ) for an appropriate λ then setting q ii appropriately.
We tested the EM according to the above remark and found in every case considered the EM always returns the same generator.

Conclusions and future research
In this manuscript, we built upon the closed form expressions for the expected number of jumps and holding times of a CTMC with an absorbing state, over given observations and used the results to derive a closed form expression for the Hessian of the likelihood. This coupled with stronger convergence has elevated the EM algorithm to be the optimal algorithm to tackle this problem.
Across the battery of tests carried out, the EM algorithm outperforms competing algorithms. The EM is a tractable algorithm, slower than the deterministic algorithms but still several orders of magnitude faster than the Markov-Chain Monte-Carlo alternatives (table 2). The statistical algorithms (EM and MCMC) embed a strong robustness property for the estimator contrary to the deterministic algorithms, i.e. the likelihood is far less sensitive to small changes in the underlying TPM. In terms of estimating risk charges, the EM algorithm has superior results in all scenarios.
On the more practical side, Figure 5 highlights that for lower ratings algorithms produce essentially the same estimates for the probabilities of default while a palpable difference emerges at higher ratings. Moreover, the error estimates in the EM may provide a sensible way to test in effect model risk.
Lastly, non-Markovian phenomena like rating momentum (see Lando and Skodeberg 2002) and appropriate models to tackle it will be addressed in forthcoming research.

A.1. Proof of Lemma 2.8
We now provide the proof of Lemma 2.8, all terms used have the same definition as they did when the Lemma was stated. Throughout we assume i = h, thus from from Assumption 2.7 P Q (X (t) = j|X (0) = i) > 0 for all j ∈ {1, . . . , h} and t > 0. The first inequality we prove is the lower bound on the expected number of jumps. Following the assumptions in Lemma 2.8 and time homogeneity we make the observation . The above inequality holds because we are only considering X (0) = i, X (t) = j and not all possible combinations of start and end states, moreover, P Q (K i j ≥ 1|X (0) = i, X (t) = j) ≤ ∞ n=1 nP Q (K i j = n|X (0) = i, X (t) = j). We further observe, Thus the lower bound in inequality (2.7) can be easily obtained. We now prove the upper bound on the expected number of jumps. The first observation we make is for all ν ∈ {1, . . . , h}, To see this, let μ = i, then denote by τ i the first time the process enters state i (if P Q (X (t) = i|X (0) = μ) = 0 for t > 0, then the result is trivial), by the law of total probability we find, The second term is zero. Then, using the Markov property we obtain, Consequently from this observation and (2.6) we obtain, Observe that, The numerator is easy to bound by considering the expected number of jumps out of i, The denominator requires further analysis, firstly, let n = |i − ν|, and therefore by Assumption 2.7 we can go from state i to ν in n jumps, w.l.o.g. let i ≥ ν (it will become clear that the ordering does not matter). Firstly, if i = ν then, P Q (X (t) = ν|X (0) = i) ≥ e q ii t .
For i > ν, we use the Markov property to obtain, P Q (X (t) = ν|X (0) = i) ≥ n a=1 P Q X a n t = i + a X a − 1 n t = i + a − 1 .
Conditioning on X only making one jump in each increment we obtain, As n ≤ h and the terms are strictly smaller than 1, the sought result follows (independent of ν = i).
The last inequality to prove concerns the holding times. By taking for P u ii > 0, E Q [S i (T )|P] ≥ P u ii E Q [S i (t)|X (0) = i, X (t) = i] ≥ P u ii t exp{q ii t} , where the final inequality follows by simply considering the case of no jumps. We can then apply the bounds from Assumption 2.7 to complete the inequality.

A.2. Proof of Theorem 2.14
We recall from Wilcox (1967); Tsai and Chan (2003) that for a square matrix M whose elements depend on a vector of parameters {λ 1 , . . . λ r } (for r ∈ N), the following identity holds