The maximum entropy mortality model: forecasting mortality using statistical moments

ABSTRACT The age-at-death distribution is a representation of the mortality experience in a population. Although it proves to be highly informative, it is often neglected when it comes to the practice of past or future mortality assessment. We propose an innovative method to mortality modeling and forecasting by making use of the location and shape measures of a density function, i.e. statistical moments. Time series methods for extrapolating a limited number of moments are used and then the reconstruction of the future age-at-death distribution is performed. The predictive power of the method seems to be net superior when compared to the results obtained using classical approaches to extrapolating age-specific-death rates, and the accuracy of the point forecast (MASE) is improved on average by 33% respective to the state-of-the-art, the Lee–Carter model. The method is tested using data from the Human Mortality Database and implemented in a publicly available R package.


Introduction
The mortality experience of a population is well described by its hazard rates, and the age-pattern that has been well exploited in numerous methods of mortality modeling (Gompertz 1825;Makeham 1867;Siler 1983;Heligman and Pollard 1980) and forecasting (Lee and Carter 1992;Li and Lee 2005;Haberman et al. 2014). The main reasons why hazard rates have been used predominantly in modeling and forecasting is that they readily represent the change in the risk of death over age and time. In addition, the five components of the pattern of human mortality (infant, child, youth, adult and old age mortality) can be identified clearly in a graphical representation of the hazard curve; and a variety of time series models can be employed to extrapolate the identified trends over time.
Surprisingly, few mortality methods acknowledge that the probability density function of the distribution of deaths can be equally informative when compared to the hazard indices: and more than that it can give an immediate indication on key measure of longevity like how long a population lives on average and the degree of variability of ages at death. By employing statistical knowledge about the shape of the distribution, and how the level of mortality at a certain age is fully dependent on the levels of mortality at all the other ages, brings enormous advantages. When hazard rates are investigated, no conclusion can be made about mean age-at-death, or the inequality experienced by the population when it comes to death, or what is the prevalence of extraordinary long life spans (the outliers in old age mortality) or even summary statistics related to the mortality experience. For example, Figure 1 illustrates the transition in the age-at-death distribution for men living in England and Wales. We learn that in 1960 the population experienced a pronounced infant mortality level up to the age of 5, and the fact that it takes about 53 years in order for the first 10% of men to die. In 1960, only 10% of the male population had the chance of living beyond the age of 85. Analyzing the same distribution in 2016, we can see that infant mortality dropped to almost insignificant levels (an inspection on the logarithmic scale would still show differences over ages), and that it takes about 63 years to eliminate the first 10% of the male population through death. We also learn that more that 50% of men are surviving to age 85, and more than 10% of the population will surpass the age of 93.
The first attempt to describe the mortality pattern by analyzing the age-at-distribution was made by Pearson (1897), building on Lexis' work (1879), where different functions were being used to capture the components of the pattern of human mortality. Pearson used a skewed function, arguing that the skewness of old age mortality depends on the incidence of premature mortality. More recently, (Dellaportas et al. 2001) uses death counts to fit the Heligman-Pollard model with Bayesian methods, and Mazzuco et al. (2018) models mortality by fitting a half-normal and a skew-bimodal-normal distribution to the observed empirical age-at-death density function.
Despite being well suited to portray the dynamics of mortality patterns and to study longevity and lifespan variability, age-at-death distributions have generally been neglected in forecasting practice. The life table distribution of deaths, or f (x) 1 , is constrained ensuring that its elements add to the radix of the population, usually l 0 = 1, thus having x f (x) = 1. Time series extrapolations of its trends are likely to violate this assumption making burdensome the use of the same extrapolative methods as in the case of hazard rates. Notable attempts at forecasting the age-at-death distribution were made by Oeppen (2008) and Bergeron-Boucher et al. (2017) by adapting the Lee-Carter model to the Compositional Data Analysis framework (Aitchison 1982). While this approach solves the problem of respecting the unit sum constraint, it also requires changing the coordinate system from a Euclidean space to an Aitchison simplex (1986) which might hinder the interpretation of the results. Another novel idea is introduced by Basellini and Camarda (2019) by modeling the shifting and compression dynamics of the adult mortality distribution around the modal age at death with a 3-parameter function, parameters that can be extrapolated using standard time series models.
Inspired by the idea of forecasting mortality, given the information offered by the age-at-death distribution, we propose a novel approach to forecasting age-specific mortality levels by making use of statistical moments, i.e. the shape measures of a density. By employing the knowledge about the shape of the distribution mentioned earlier, and how the level of mortality at a certain age is fully dependent on the levels of mortality at all the other ages, brings an enormous advantage to the extrapolation methods based on death frequencies over the methods based on hazard or mortality rates extrapolation. One statistical moment describes one characteristic of the distribution it belongs to. For example, in the case of the distribution of deaths of a population the first moments correspond to: (i) the mean age at death or life expectancy; (ii) the variance which offers information about the inequality of the age at death; (iii) the skewness which states how concentrated the mortality is around young or old ages, and (iv) the kurtosis which indicates the weight of the tails or the presence of outliers (extreme old age) in the distribution. Beyond moments higher than the fourth the interpretation is limited, however these are relevant in the case of more complex distribution (multi-modal densities) which helps in fine-tuning the observed irregularities. Thus, for a distribution, the collection of all the moments uniquely determines its density function. In order to gain a perfect understanding of the underlying density function, one needs to have information about all the moments up to infinity. However a limited number of moments like mean, variance, skewness and kurtosis can already offer a good approximation of the shape of the probability density function of the underlying distribution of deaths, and therefore a good understanding of the levels of mortality experienced by a population at different ages.
The method proposed here considers the finite moment problem where a positive density, f (x), is sought from knowledge of a limited number of its power moments. We assess the evolution of several observed moments of the age-at-death distribution in order to forecast them by employing multivariate time series models; and we reconstruct the forecast distribution using the maximum entropy approach (Mead and Papanicolaou 1984) that relies on the average rate at which information is produced by a stochastic source of data or density function, i.e. information entropy. Reconstructing the density function from a set of predicted moments has the advantage of allowing accelerating/decelerating rates of mortality improvement over age and time. It also eliminates the necessity of altering the coordinate system as in Oeppen (2008). We will refer to this method as the maximum entropy mortality model (MEM).
The purpose of this study is to: • demonstrate that extrapolation methods based on death frequencies can be advantageous to methods based on mortality rates. We are also interested in showing that accurate forecasts of age-specific mortality levels can be obtained using statistical moments and the information provided by the age-at-death distribution; • compare the MEM against other well established mortality models and determine its newly added value; • validate the MEM predictions against a benchmark, that is, a simplistic trend extrapolation of the age-specific death rates (naïve model) in order to justify the increase in complexity of the proposed method. This objective is justified and inspired by Bohk-Ewald et al. (2018), a study where 20 major fertility forecasting methods are evaluated. The main findings show that across multiple measures of fertility forecast accuracy only four methods consistently outperform the naïve model.

Methods
In this section, we introduce our method of modeling and forecasting mortality and the main concepts required to understand its estimation procedure.

Statistical moments
The statistical moments are defined as the expected value of the n-th power of a random variable x.
The n-th moment, μ n , for a continuous density function, f (x), about a value c can be defined as: where a and ω refer to the range of the distribution. If variable c denotes the mean, it is said that μ n is the n-th moment about the mean or the n-th central moment. The moments can be computed about zero as well, in which case the moment is called a raw or crude moment. The normalized moment of a probability distribution is a central moment that is standardized. The normalization is typically a division by an expression of the standard deviation, σ , which renders the moment scale invariant. This has the advantage that such normalized moments differ only in other properties than variability, facilitating, e.g. comparison of the shape of different probability distributions. Then the normalized moment of degree n of a central moment isμ Both the raw and the normalized moments are used in this study. While the normalized moments are a good choice to serve as indices in time series extrapolation (as we will see in Section 2.4), the raw moments are more efficient in the estimation of the underlying density functions (Section 2.3). If one type of moments is known, all the others can be derived from these without losing their properties.

Information entropy
The concept of information entropy was introduced by Shannon (1948) in an effort to mathematically formalize the process of communication in computer science (between two devices). The word information, here, is used in a special sense and must not be confused with meaning. The information is seen as a measure of the numbers of the possible outcomes generated by a probabilistic process and it is defined as the logarithm of this value. Warren (1953) reveals that the quantity which uniquely meets the natural requirements that one sets up for information turns out to be exactly that which is known in thermodynamics as entropy and which is a measure of the degree of randomness.
To explore the numerical values of the entropy measure H, let us assume that the individuals in a population can die in one of the following two states: childhood (C) and adulthood (A). The associate probabilities would be f (C) for the first state and f (A) = 1 − f (C) for the second. It follows that: Since the logarithm of a number smaller that 1 is a negative value, the minus sign in the equation is added for convenience only, so that H is always positive. It turns out that the information entropy has its largest value, when the events of dying in childhood and adulthood are equally probable; that is when f (C) = f (A) = 0.5. Just as soon as one outcome becomes more probable than the other (e.g. adult mortality greater than childhood mortality) the value of H decreases. When one outcome is very probable (f (A) almost one and f (C) almost zero, say) the value of H is approaching zero, creating a situation in which the stochastic process of dying in different states becomes less random and the outcome almost certain. Therefore, information entropy is also a measure that can show the level of inequality experienced by individuals represented in a distribution such as age-at-death. A small value of the entropy indicates that everyone dies around the same ages, i.e. the population is characterized by a small degree of inequality in terms of age-at-death. If H is large one can say that the inequality is pronounced.
To generalize, the entropy of a random variable x with probability distribution function f (x) is the negative logarithm of the density function for the value, and can be written as: and where E is the expected value operator, I is the information content of x and b is the base of the logarithms used.

The finite moment problem
The problem of reconstructing a distribution from a given number of moments is not straightforward. It is known in the mathematical literature as the finite moment problem. The method has been extensively studied from a theoretical perspective and has practical applications in thermodynamics and quantum-physics. It can be regarded as a finite dimensional version of the Hausdorff moment problem (Hausdorff 1921;Shohat and Tamarkin 1943). Various methods for solving this problem have been proposed in the last decades, by making use of orthogonal polynomials (Chihara 1978), splines (John et al. 2007), or other numerical strategies (Frontini et al. 1990). All the procedures aim at constructing specific sequences of functions f N (x) which eventually converge to the true distribution f (x) as the number of moments N, used in estimation, approaches infinitŷ Equation (6) should be seen as a system of N+1 equations, where the momentsμ 0 , . . . ,μ N come from the f N (x) density. Taking advantage of the regularity of human mortality the reconstruction of a density function can be realized using a small number of moments, usually 3-6. And, a good fit of the true density is achieved by imposing a prior restriction of the class on functions where the solution is sought.
Here, we follow the maximum entropy reconstruction (MaxEnt) and the algorithm developed by Mead and Papanicolaou (1984) as a definite procedure for the construction of a sequence of approximations to the true density. This method is based on the information entropy given by the density function. As a strategy for finding the local maxima of the entropy functional L = L(f ), we employ the method of Lagrange multipliers, λ n for the n-th moment: The entropy, as defined in Equation 4, is maximized under the condition that the first N+1 moments, μ n , are equal to the true moments μ n , where n takes values between 0 and N. Functional variation of L with respect to the unknown density f (x) yields and the n−th raw moment Considering the availability of the first N+1 moments, the Equations (8) and (9) (that are closely related to Equation 6) should be viewed as a non-linear system of N+1 equations for the unknown Lagrange multipliers λ 0 , λ 1 , . . . , λ N . If we assume that the density f (x) is normalized such that the first moment is always equal to 1 (μ 0 = 1, i.e. respecting the unit sum constraint), the first equation in (9) then reads and results in the first Lagrange multiplier, λ 0 , being expressed in terms of the remaining Lagrange multipliers: The system of equations then reduces to For a numerical solution, one introduces = (λ 1 , λ 2 , . . . , λ N ) through the Legendre transformation where the μ n 's are the true statistical moments of the underlying density. The stationary points of the potential are solutions to the equations which is the solution to the finite moment problem. The convexity of guarantees that if a stationary point is found for some finite values of λ 1 , λ 2 , . . . , λ N it must be a unique absolute minimum. A more detailed description and analytic demonstration of the method and also alternative algorithms for solving the finite moment problem can be found in Mead and Papanicolaou (1984). Figure 2 shows the observed and reconstructed distribution of deaths using different numbers of observed statistical moments for the male population living in the USA in 1990. This distribution of deaths is a good study case because it is characterized by a pronounced level of mortality in the first year of life, an accident hump around age 20 and an exponential increase in adult and old age mortality. More recent distributions exhibit less pronounced local maxima or modes, therefore making them easier to estimate. Knowing only the first two moments, mean and variance, is not sufficient to obtain a good reconstruction of the underlying distribution. The obtained coverage, i.e. the common surface of the observed and estimated distributions, would be around 80%. However, the more moments we employ in the estimation procedure the bigger the coverage becomes. The results indicate that six moments are enough to obtain a coverage above 96% where the infant and adult mortality is captured adequately. We note here that above a large enough coverage level, the measure does not necessarily indicate a more accurate approximation of the true distribution, but better identification of the main body of the distribution.

The maximum entropy mortality model
The idea behind our forecasting method is simple. The future age-specific levels of mortality for a population are determined by extrapolating a limited number of statistical moments given by the available life table age-at-death distributions. The extrapolation is realized with multivariate time series models. The age-at-death distribution is estimated at any point in time from the predicted moments using the MaxEnt algorithm (introduced in Section 2.3).
Prior to generating future realizations, the moments of ordinal 3 and higher are normalized, and the logarithmic transformation is applied to the absolute values of all observed moments. This ensures that the relevant shape measures remain positive on any forecasting horizon (e.g. the mean and the variance of the distribution). The period index of interest to be used in forecasting, with first order differences, can be defined from the empirical central moments,μ n,t in Equation (2), as follows: We are using a multivariate random-walk, with a vector of drift parameters θ n , to drive the dynamics of the multiple period indices, so that y n,t = θ n + y n,t−1 + ε n,t with t = 1, 2, . . . , τ and n = 1, 2, . . . , N, where ε n,t ∼ N(0, ), with = CC . C represents the Cholesky factorization matrix of the variancecovariance matrix . The parameters θ n and are estimated by ordinary least squares (OLS). A similar model is used by Haberman and Renshaw (2011) to generate trajectories of the multiple period indices in various mortality models. Once the y n,t forecasts are obtained, one can compute the statistical moments, estimate the distribution of deaths at time t using MaxEnt, and derive any other life table indicator by applying standard life table calculations (Preston et al. 2001).

Prediction intervals
We simulate prediction intervals for the indices of interest using an algorithm, which makes full allowance for the forecast error generated by the multivariate random-walk model.

The data
The data source used in this article is the Human Mortality Database (2018), which contains historical mortality data for 43 different countries and territories. HMD constitutes a reliable data source because it includes high-quality historical mortality data that was subject to a uniform set of procedures, guaranteeing the cross-national comparability of the information.
In order to test and illustrate the performance of the method, we fit the model using life table death counts for the male population in England and Wales between 1960 and 2016. Additional results for various countries are presented in Appendix 2.

Model comparison
In addition to the MEM model, the following mortality models are evaluated and used to forecast mortality: • The multivariate random-walk with drift model: This model represents a simple linear extrapolation of the logarithm of the age-specific death rates, m x , based on the first and the last observed values in the multivariate time series.
• The Lee and Carter (1992) mortality model: which is a numerical algorithm to estimate the age-specific effects α x and β x and employs the singular value decomposition (SVD) to derive a univariate time series vector k t , that becomes the main leading indicator of future mortality. • The Hyndman and Ullah (2007) functional mortality model: This model is an extension of the Lee-Carter model where the sum term allows for smooth functions of age and σ t,x allows the amount of noise to become age-specific. • The Renshaw and Haberman (2006) model -a cohort-based extension of the Lee-Carter model: The model incorporates a cohort effect γ t−x to the Lee-Carter predictor to better capture discontinuities in mortality trends. We are using the simplified and more stable version of the model where β x = 1 at all ages, as suggested in Haberman and Renshaw (2011). • The Oeppen (2008) model -the compositional-data mortality model: Again, a variant of the Lee-Carter model where the index of interest to be modeled is the life table age-at-death distribution f (x), subject to a clr transformation (instead of a logarithmic one). For all models, ε x,t are independent and identically distributed random variables (iid) normally distributed with mean zero.
Thus, six models are evaluated. Four of them are targeting the log-transformed death rates, log m x , and the other two (Oeppen and MEM) are focusing on modeling the age-specific frequencies of the age-at-death distribution or mortality data in compositional format. We mention that the randomwalk model is chosen because of its simplicity, the Lee-Carter and Hyndman-Ullah methods are included in comparison because of their popularity and acceptance in the demographic and actuarial literature. The Renshaw-Haberman is a good example of how the projections can be improved by controlling for cohort effects and finally the Oeppen model is selected because of its similarity with the MEM model (mainly, the f (x) focus). Discussing the advantages and the features of the models used in comparison is beyond the scope of this article. For more details about the models, the readers are referred to the original articles.
The analysis is performed using the R programming language (R Core Team 2018). The Lee-Carter and the Hyndman-Ullah models are fitted and forecasted using the demography R package (2017) and the Renshaw-Haberman implementation is adopted from the StMoMo software (2015). The source code of the other three models can be downloaded and installed in the form of an R software package from the authors' GitHub repository.

Evaluation and predictive power measurements
We assess the performance of the proposed MEM method and the other five mortality models based on forecasts of life expectancy at all ages 2 as compared to the observed life expectancies. All the models are fitted and evaluated over the 0-95 age range. Since a perfect fit of the data can always be obtained by using a model with enough parameters, and due to the fact that a good fit does not guarantee good forecasting performance (Hyndman and Athanasopoulos 2018), we will not evaluate the models based on their ability to fit the historical data. The models are evaluated based on the out-of-sample forecasting performance over the observed data, where we refer to sample as the data-set used in fitting. The predictive power is our ultimate goal, translated into a high degree of accuracy of forecast trajectories.
For each scenario and model, we are computing a matrix of accuracy errors. The overall accuracy of the forecast for a specific model-scenario is the average of all the values in the corresponding matrix. For a given population multiple scenarios of equal dimension (see Section 3.4) are explored in order to account for the robustness of the models. The aggregation is done by averaging the results over all scenarios for each model/accuracy-measure/country. The sMRAE and MASE are accuracy measures computed relative to a benchmark model. In the case of sMRAE, the reference model in our study is the multivariate random-walk with drift, because we consider this model to be the simplest reliable method to extrapolate age-specific death rates. The MASE measure assesses the accuracy of a forecast with reference to a simple 1-step random-walk model (without drift). For all the presented accuracy measures, except ME, a smaller value is preferred over a larger one. A model performs better in terms of ME, compared to another model, if the obtained value is closer to zero, i.e. the smallest value in absolute terms. Because the six measures are evaluating the accuracy by analyzing different aspects of the realized forecasts, it is possible but not mandatory to obtain a different classification of the model performance, depending on the considered measure. We computed a general classification (GC) of the resulted accuracy performance of the analyzed models by considering the median classification over the six measures for each model. The best performing model is marked with: rank (1). A detailed description of the accuracy assessment process is provided in Appendix 1.

Out-of-sample forecasting strategy
The period between 1960 and 2016 is long enough and relevant at the same time for assessing the predictive power of the estimated models. A typical testing scenario would use sub-periods of 20 years of data to fit/train the models, and subsequent periods of 20 years of data to forecast and validate the results. Multiple testing scenarios are defined by rolling forward the training/validation windows in steps of 1 year. This strategy will be called: 20-20-1. Therefore, considering our timeline, in the first scenario we will use data from 1960 to 1979 for fitting the models. The resulting models will be used to predict mortality for the period between 1980 and 1999, and validate the forecasts against the observed values in the same period. We will refer to this as the 1960-1979-1999 scenario. By moving the evaluation windows 1 year forward, the second scenario would be 1961-1980-2000 1977-1996-2016. In total, 18 scenarios have been defined, containing equal fitting and forecasting period lengths, making possible in this way the aggregation (by averaging) and comparison of the accuracy results over all scenarios in addition to the specific scenario results.

Results
Across multiple measures of forecast accuracy computed based on the mortality experience of the male population living in England and Wales, we find that the predictive power of the MEM method stands out when compared to the other models. The back-testing of the male mortality in England and Wales indicates that only the Renshaw-Haberman model returns slightly more accurate results. Table 1 displays a summary of the aggregated measures over 18 defined scenarios. We also learn that the differences between the multivariate random-walk with drift, Lee-Carter and Hyndman-Ullah models are insignificant in the case of this population; that the Oeppen method consistently offers better results among the Lee-Carter type models, obtaining the third position among the best performing models in this study.
When the models are tested over the mortality experience of multiple populations using the same strategy we discovered a similar patter, namely, in the majority of the cases the methods based of age-at-death distribution (MEM and Oeppen) would be ranked as the best performing models. The performance of the Renshaw-Haberman model is interesting to observe, because in some countries like the Netherlands, Sweden and England and Wales, it is able to capture the cohort effects and successfully extrapolate them into the future predicting adequately the true mortality experience. However, the same strategy applied in France, Sweden or the USA completely places the model on the last position in the general classification and the reliance on observed cohort effects can create more damage to an improvement of the forecasts. The results are shown in Table 2 and Appendix 2.
The projected trajectories given by these methods can be inspected across different life table indicators, which are equivalent in the sense that they represent the same level of mortality. In Figure 4, we show the resulting mean trends in life expectancy at age 0, 25, 45, 65, 75 and 85; and in Figure 5 the trends in central death rate at the same ages are represented. It is noticeable that the MEM can cope with different levels of mortality improvement over age and time. This is the main reason why the model is able to return significantly better forecasts.

How many moments to use in MEM forecasting?
In general, the number of statistical moments to be considered in the MEM model depends on the regularity of the age-at-death distribution in the population of interest. The more moments employed, the more accurate the estimation of the underlying distribution becomes. However, the cost of using a larger number of moments is paid in processing speed and the likelihood of convergence of the MaxEnt algorithm. For seven or more moments, a more complex time series model for moment extrapolation might be required. We tested the MEM models of order 2-6, that is models that are estimated based on the first 2, 3 until 6 moments (plus μ 0 ). The testing was carried out in the same manner and over the same scenarios as in Section 3.5. From the results in Table 3, we learn that only the MEM-2 is disqualified by the benchmark model, all the other variants of the MEM return significantly better results.

Conclusion and discussion
The maximum entropy mortality model represents a new approach to modeling age-specific mortality levels. Nevertheless, its novelty refers only to the authors' idea of defining an algorithm to forecast mortality using well established methods and concepts like statistical moments, information entropy, the finite moment problem and the multivariate random-walk with drift introduced decades or centuries ago. All these individually have an important application in multiple scientific fields.
The main advantage of the MEM is that the possible forecast age-specific trends are no longer based on the assumption of constant changes in mortality as in the Lee-Carter model. A different speed of improvement can be predicted across ages by taking into account the observed dynamics of the distribution of deaths and the changes in its shape and location. This makes possible the identification of the location of the longevity risk across the age-axis. The model has the required features to predict the different rates of change in life expectancy at different ages, e.g. by maintaining a linear increase in life expectancy at birth and at the same time inducing accelerating rates for ages above 65. This result is consistent with the observed trends in the past and across many developed countries. The method was not designed to capture cohort effects, instead our approach is purely statistical. However, the results presented here show that MEM can capture cohort effects more accurately than most of the other models. This is partially due to the approach of modeling mortality in a compositional framework and focusing on the death distributions (i.e. if a life is saved at age x it must be placed back in the death-distribution at later age, say x+t).
Even if it is not shown here, the model introduced in this article is flexible enough in the sense that different covariates, influencing the mortality dynamics, can be considered in order to further improve the predictions. Examples of such covariates are information on the prevalence of smoking or obesity but also the trends in life expectancy at birth and modal age at death. This can be done by extending the time series model used in extrapolation (the multivariate random-walk with drift) by attaching extra cause-specific parameters. Similarly, a wide range of multivariate autoregressive time series models can be used to capture the coherent trends given by the observed empirical statistical moments, however this is subject to the requirements imposed by the available data.
Including higher order moments in the prediction can sometimes increase the importance of relatively small effects seen in the age-at-death distribution such as the accident hump for males or infant mortality. In countries with a low level of mortality, four moments can return pertinent results but in countries that still exhibit a pronounced multi-modal distribution of deaths a larger number of moments might be required. The coverage proportion, introduced in the article, is a very good measure for studying the models ability to estimate the shape of the true distribution, but it has an important drawback: two errors of the same magnitude at different ends of the distribution will be assigned the same weight in the measure. This is a drawback because underestimating or overestimating the force of mortality at younger ages has a higher impact on the general level of mortality of a population than underestimating or overestimating the force of mortality at an older age. If we investigate the evolution of longevity by looking at life expectancy, we can see that relying solely on information given by the mean age at death and life span disparity in order to generate forecasts is an endeavor doomed to failure (as showed in Table 3). The model would mostly return pronounced pessimistic or overoptimistic results across ages without a correspondence to reality. A real improvement can be noticed if 4-6 moments are used.
Other possible extensions of the MEM method worth exploring in the future are the use of an optimal weighting scheme which decreases the importance of higher order moments or the use of various smoothing methods that can be applied to the data prior to computing the observed moments and fitting the model.
An important finding revealed in this study is the superiority of the age-at-death distribution based extrapolation methods, like the MEM and Oeppen (2008) over the death rate-based extrapolation methods.

Reproducible research
The presented model and algorithm is implemented using the R programming language (R Core Team 2018) and can be downloaded and installed in form of an R software package from authors GitHub repository (https://github.com/mpascariu/MortalityForecast). The results and figures for the countries presented in the article can be reproduced using the code and data saved in the MortalityForecast package -version 0.8.0.
The discrete approximations follow the method presented in Preston et al. (2001) and implemented in the MortalityLaws software package.
where δ(x, t) is the distance between the observed and predicted life expectancy calculated as, The aggregate accuracy measures are computed over a forecasting window of 20 years and over the 0-95 age range, i.e τ = 20 and ω = 95. The aggregation process is done by averaging the accuracy results over age and time. To further simplify the notations, we define algebraically the mean operation of a random variable z(x, t) as: Then the six accuracy measures can be summarized as: • ME -Mean Error For a specific country and model, multiple scenarios of equal dimension (fitted/forecasted years) are explored in order to account for the robustness of the models. Usually, 18 scenarios per country. The aggregation is done by averaging the results over all scenarios for each model/accuracy-measure/country.