Modeling the Extremes of Bivariate Mixture Distributions With Application to Oceanographic Data

Abstract There currently exist a variety of statistical methods for modeling bivariate extremes. However, when the dependence between variables is driven by more than one latent process, these methods are likely to fail to give reliable inferences. We consider situations in which the observed dependence at extreme levels is a mixture of a possibly unknown number of much simpler bivariate distributions. For such structures, we demonstrate the limitations of existing methods and propose two new methods: an extension of the Heffernan–Tawn conditional extreme value model to allow for mixtures and an extremal quantile-regression approach. The two methods are examined in a simulation study and then applied to oceanographic data. Finally, we discuss extensions including a subasymptotic version of the proposed model, which has the potential to give more efficient results by incorporating data that are less extreme. Both new methods outperform existing approaches when mixtures are present.


Introduction
The dependence between multivariate response variables is often driven by their co-dependence on one or more underlying driving processes. For example, the strength of the relationship between wind speed and wave height depends on a combination of wind direction, land shadows, atmospheric pressure systems and their underlying driving processes. Often, these driving processes are either unknown or unobserved, or both. If the driving processes are known, then the interaction between them is likely to be highly nonlinear and without specific knowledge of the physical processes that drive the response, the dependence structure can look highly complex. Consequently, building parsimonious statistical models that capture well the complex, multilayered data-generating mechanisms is difficult. One possible form of a complex dependence structure occurs when the joint distribution is a mixture of two or more simpler jointly distributed random variables. Such situations are the focus of this article since the limiting assumptions of standard multivariate extreme value theory do not hold at nonasymptotic levels for problems of this type.
We are motivated by an oceanographic application with complex extremal dependence structures. In the design of offshore facilities, it is crucial-for both safety and reliability reasons-to protect against the most extreme storms. Hence, it is necessary that the extremal dependence structures between the multiple physical hazards that may co-occur during a storm are well understood. We consider a synthetic response variable (Ross et al. 2020) that is a function of significant wave height H S and wave period T 2 , illustrative of the response of floating CONTACT Stan Tendijck s.tendijck@lancaster.ac.uk Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YW, UK. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. offshore structures to wave loading. In particular, these synthetic response variables increase with increasing H S and for T 2 approaching a resonance frequency. The definitions of H S and T 2 are given in Holthuijsen (2010).
We consider (T 2 , H S ) data from a location in the northern North Sea, see Figure 1. There are two different types of waves present-swell waves (relatively large wave period compared to significant wave height) and wind-generated waves (relatively small wave period compared to significant wave height). These correspond to waves generated on a large spatio-temporal scale and local wind-generated waves, respectively. Without expert knowledge, we are unable to identify which wave type each observation corresponds to. We propose and test two novel methods to make inference on data for which the dependence structure arises in this way, and for which it is either nontrivial or impossible to identify the process from which each observation has been generated.
It is standard practice in multivariate extreme value analysis to assume that observations are drawn from a common distribution and to use limit theory as a basis for extrapolation to give estimates on the probability of rare events. Traditionally, separate classes of extreme value models have been developed for bivariate data that exhibit either asymptotic dependence or asymptotic independence, see Coles, Heffernan, and Tawn (1999) for an overview. Two random variables X and Y are asymptotically dependent if the probability that they are both large is of the same magnitude as when one is large, that is, is such that χ > 0, where F X and F Y denote the distribution functions of X and Y. If this limiting quantity χ = 0, then we say that the variables are asymptotically independent. In a multivariate setting, it is possible that some subsets of variables are asymptotically dependent and others are asymptotically independent but not completely independent, see for example Simpson, Wadsworth, and Tawn (2020). Many extreme value methods, for instance, Coles and Tawn (1994), Joe (1994), Capéraà, Fougères, and Genest (1997), Naveau et al. (2009), and Genest and Segers (2009) are based on the assumption of multivariate regular variation. This means that they cannot model the distribution well for parts where some variables are large and others are not. Heffernan and Tawn (2004) introduced the more flexible conditional extremes model which can handle this situation. In particular, in the bivariate case their model provides a description for Y|X (X|Y) in the region where X (Y) is large. More precisely, when X and Y are transformed marginally to follow standard Laplace distributions this model takes on the form of heteroscedastic regression, the parameters of which are estimated using only observations for which X (Y) is large, see, for example, Keef, Papastathopoulos, and Tawn (2013).
However, despite the flexibility of the Heffernan-Tawn model, the "vanilla" version does not provide good estimates when the data consists of nontrivial mixture structures such as that shown in Figure 1, in which we can see that significant wave height conditional on wave period grows in two different ways as wave period increases. Sometimes both variables are relatively large (wind-generated waves) at the same time, but at other times only significant wave period is high and wave height takes on more moderate values (swell waves). As will be seen in Section 2.2, this is an indication that the vanilla Heffernan-Tawn model is not likely to be suitable in practice.
Our proposed solution to this limitation is a mixture formulation of the Heffernan-Tawn model which assumes that, after a suitable transformation, Y|X can be captured by a mixture of K = 1, 2, 3, . . . regressions each associated with a different probability weighting. Whilst we take K to be unknown, it will not be treated as a parameter, rather we use model selection methods to select the most appropriate value for K, and do not account for its uncertainty subsequently. A second method comprising a novel quantile-regression-based mixture approach is also developed. This uses the same parametric form for the conditional quantiles as proposed in the mixture formulation of the Heffernan-Tawn model. It differs from the Heffernan-Tawn mixture formulation mainly in the sense that it is more flexible due to its semi-parametric nature. We find that for bivariate data both methods show similar performance in estimating probabilities of extreme sets, that is, P((X, Y) ∈ A) where A is extreme in X or Y or both, with the uncertainty assessed through the use of a semiparametric bootstrap. The main advantage of the quantile-regression approach is that it leads to more stable fits, that is, it gives more consistent results for small sample sizes. However, unlike the mixture formulation of the Heffernan-Tawn model, it does not extend naturally beyond bivariate data. We also discuss a subasymptotic version of the Heffernan-Tawn mixture model motivated by a theoretical example where the mixture probabilities vary with the level of extremity. However, we find that this model does not perform better in modeling the dependence structure of (T 2 , H S ) than the Heffernan-Tawn mixture model.
It is worth noting that mixture models have previously been used for multivariate extremes. Boldi and Davison (2007) focus on a mixture of asymptotically dependent variables. Simpson, Wadsworth, and Tawn (2020), Chiapino, Sabourin, and Segers (2019), and Engelke and Ivanovs (2020) investigated ddimensional random variables and describe how to estimate on which of 2 d − 1 subspaces the limiting spectral measure has mass. If we let d = 2 and condition on one variable being large, then the resulting two subspaces correspond to asymptotic dependence and asymptotic independence. So, when one mixture component is asymptotically dependent and one asymptotically independent, there are strong similarities between the problem here and that considered by these authors, although only Simpson, Wadsworth, and Tawn (2020) considered the form of the asymptotic independence term in the mixture. However, none of the existing work considers the situation with two or more different levels of asymptotic independence.
The article is organized as follows. In Section 2, the current conditional extremes modeling framework and its extension incorporating mixture structures are discussed. Section 3 introduces the two inference methods that exploit the framework from Section 2. A subasymptotic version of the conditional extremes mixture model is discussed in Section 4. Finally, the oceanographic application is presented in Section 5. To conserve space, a simulation study is found in the supplementary material. Code and data are published at https://github.com/ stantendijck/HTMixtureModel.

Heffernan-Tawn Model
Let (X, Y) be a real-valued bivariate random vector on standard Laplace margins. If (X, Y) do not follow such marginal distributions, then transformation to standard Laplace margins is achieved using the probability integral transform (Keef, Papastathopoulos, and Tawn 2013). The Heffernan-Tawn (HT) model, from Heffernan and Tawn (2004), extrapolates the joint distribution of (X, Y) to the region of the sample space where either one, or both, of the variables is extreme. Without loss of generality, we present the model for the case in which X is extreme; the full bivariate model requires an equivalent definition for the case of Y being extreme.
The underlying principle for the model is that there exist parameters α and β with |α| ≤ 1, β < 1 such that the normalization of the conditional random variable Y|{X = x}, converges in distribution to a nondegenerate random variable Z with distribution function G(z) as x → ∞. Under assumptions relating to convergence and existence of joint densities, Heffernan and Resnick (2007), Resnick and Zeber (2014), and Wadsworth et al. (2017) showed that this implies that in the limit the random variables Z and X − u are independent conditional on X > u, that is, for (2) holds. To ensure identifiability of α and β, Keef, Papastathopoulos, and Tawn (2013) imposed the condition that lim z→∞ G(z) = 1, that is, no mass of Z is at {+∞} but there can be mass at {−∞}. They also impose additional constraints on α and β such that the implied distribution of Y is not inconsistent with its marginal distribution. We call these the Keef constraints. Inference is performed by assuming that the limiting relation (2) holds above a finite threshold level u. Specifically, where D = represents equality in distribution,Z is the standardized residual random variable, that is,Z = (Z − μ)/σ , with μ the mean of Z and σ > 0 the standard deviation. The model parameters α and β are typically inferred via assuming that the distribution of Z is Gaussian. The four HT parameters (α, β, μ, σ ) can now be estimated using any method of statistical inference. Conditional on the estimated parameters, the distribution of the residual random variable is then estimated non-parametrically using the empirical distribution of Z x when x > u. Finally, u is chosen as low as possible such that estimates for HT parameters are approximately unchanged at any higher threshold (an analogy of univariate threshold stability plots) whilst ensuring that X and Z are independent given X > u.
Simple interpretations of the parameters α and β exist: (i) (α, β) = (1, 0) implies asymptotic dependence between X and Y with χ = ∞ 0 (1 − G(−t)) exp(−t) dt; (ii) α < 1 implies asymptotic independence between X and Y; (iii) β > 0 implies that the variability between X and Y grows as X grows; and (iv) 0 < α < 1 (or −1 < α < 0) implies positive (or negative) association between X and Y. In environmental applications, β < 0 is unrealistic, as this imposes that all the quantiles of Y conditional on X converge to the same value as X grows to infinity, so we restrict 0 ≤ β < 1.
We explore a variant of the Heffernan-Tawn model, which is more flexible than the HT model and in some theoretical examples improves the convergence rates to limit (2), see Lugrin, Davison, and Tawn (2021). Previously, we assumed that relation (3) holds for some finite level u. However, it is possible that for u < ∞, inference can be improved if we adjust the model form. To that end, we redefine Z x from Equation (1) to By taking γ : R → R to be a function such that γ (x) = o(x β ) for β ≥ 0 as x → ∞, this is equivalent to the HT model in the limit as x → ∞. For reasons of parsimony, we take γ (x) = γ ∈ R, and so introduce an intercept in the mean component of the HT regression model, which is identifiable if β > 0.

Extreme Value Distribution With Asymmetric Logistic Dependence Structure
We now give an example of a simple bivariate distribution for which the HT model is inadequate, that is, it has limit lim z→−∞ G(z) = 0. Here, different choices for α and β also lead to nondegenerate G(z) but with lim z→∞ G(z) = 1. The findings from this example motivate our developments in Section 2.3. Let (X A , Y A ) denote a random variable on Laplace margins following a bivariate extreme value copula with an asymmetric logistic dependence structure (Tawn 1988). This distribution has parameters θ 1 , θ 2 ∈ (0, 1), η ∈ (0, 1), and distribution function ) for x ≥ 0 and t x := log 2 − x for x < 0, with t y similarly defined. In Figure 2, data simulated from this distribution, conditional on X A > 2 with θ 1 = θ 2 = η = 0.5, show two arms centered on y = x and y = 0 for different conditional quantiles. The heterogeneity in the two arms results in two possibly nondegenerate HT limits, each of which fully captures the behavior around one arm whilst treating the behavior around the other as degenerate. The first limit models the upper arm, and has (α, β) = (1, 0) and nondegenerate G(z) Note that G(z) puts weight θ 1 on {−∞}. The second limit corresponds to the lower arm with (α, β) = (0, 0) and G(z) = θ 1 e −t z for z ∈ R, which puts weight 1 − θ 1 on {+∞}. Applying the HT model to this joint distribution directly results in poor extrapolations because at any finite level the normalized random variable Z cannot capture the two different growth rates. Figure 2 (right) illustrates simulated data and implied conditional quantiles from the model we develop in Section 3 when fitted to the data shown in Figure 2 (left). The implied conditional quantiles capture the true behavior of the top and bottom quantiles well (matching the true gradients of 1 and 0, respectively) but with a less clear distinction between the two mixture components than for the true process. This analysis may be improved by picking a different threshold.  (5), conditional on X A > 2 with θ 1 = θ 2 = η = 0.5 including true conditional quantile functions; (Right) Data simulated from the inferred model (7), fitted to the data in the left plot, with an identical sample size, including the implied conditional quantile functions.

Heffernan-Tawn Mixture Model
The Heffernan-Tawn model is applicable to the example from Section 2.2 with lim z→−∞ G(z) = 0. However, no model is imposed for the part of the distribution that corresponds to G at {−∞}. This may not be a problem if our interest lies in characterizing combinations of maximum X with associated maximum Y but it might be that the other part of the distribution also corresponds to extreme scenarios. For example, in the case of H S versus T 2 , there might be wave periods corresponding to one or more marine structural resonance frequencies which are of greater interest than the maximum wave period. Motivated by these considerations, we introduce the Heffernan-Tawn mixture (HTM) model to better characterize distributions such as the one discussed in Section 2.2, conditional on one of the variables being large. We allow for multiple parameter combinations such that all possible nondegenerate residual distributions are captured simultaneously.
Let K ≥ 1 be an integer and let (X, Y) have standard Laplace margins such that the joint distribution is a mixture of K copulas. We assume that for x > u, where u is large, where F is the cumulative distribution function of (X, Y) and for all k, F k is a distribution function on Laplace margins, p k ∈ (0, 1], and K k=1 p k = 1. We also assume that each F k has a copula formulation such that the associated limit (2) holds for a single pair (α k , β k ) with residual distribution G k placing no mass at {−∞, +∞}. Thus, the asymmetric logistic copula (5) cannot be F k . As we assume the distributional form Equation (6) only for x > u, this condition holds for standard mixture distributions, but can apply for more complex models which, when in an extremal state (i.e., X > u), approximate to a mixture form.
Central to the HTM model is the assumption that the HT model, with the intercept extension proposed in Equation (4) is applicable to F k , with parameters γ k ∈ R, α k ∈ [−1, 1], 0 ≤ β k < 1, μ k ∈ R and σ k > 0, for all k. Using the notation of Equation (3), we define the K component HTM model as follows: for large u, whereZ k , which only exists with probability p k , follows a nondegenerate distribution for each k. Moreover, we assert thatZ k is independent of both X andZk for k =k. We further impose that E[Z k ] = 0 and var(Z k ) = 1, which implies that the distribution function of Z k := μ k + σ kZk does not put mass on {±∞}. For identifiability reasons, The Heffernan-Tawn model is a special case of this model with K = 1. Distribution (5) corresponds to the case K = 2 with (α 1 , β 1 , p 1 , α 2 , β 2 , p 2 ) = (1, 0, 1 − θ 1 , 0, 0, θ 1 ). When K = 2 other classes of distributions that fall into this mixture formulation with (α 1 , β 1 , α 2 , β 2 ) = (1, 0, 0, 0) are bivariate maxstable distributions with some mass of the spectral measure on the boundaries of the simplex, and max-linear models (De Haan and Ferreira 2006, chap. 6) where different innovation variables control each marginal variable.
Let (X, Y) be a bivariate random variable following model (7) and define Y k , k = 1, . . . , K, as the random variable representing the kth mixture component of the model, that is, that is, the mixture components separate completely in the limit, and as u → ∞, we get that the parameter functions γ (τ ), α(τ ) and β(τ ) in Equation (8) and β(τ ) behave similarly. The parameter function ζ τ is an increasing function of τ within each interval of τ where γ (τ ), α(τ ) and β(τ ) remain constant.
To consider multiple conditional quantiles jointly, let m ∈ N and 0 < τ 1 < · · · < τ m < 1 be m non-exceedance probabilities. We assume that the vector conditional quantile function (q τ 1 (x), . . . , q τ m (x)) belongs to the following parametric class of functions: where m is the parameter space, not necessarily equal to the Cartesian product m . In particular, we consider models with ω τ = (ϕ, ζ τ ) where ϕ is common across ω τ for all τ and ζ τ is specific to a particular τ . As an illustration, consider model (7) with K = 1, then ϕ = (γ 1 , α 1 , β 1 ) and ζ τ = F −1 Z 1 (τ ) where F Z 1 is the distribution function of the residuals Z 1 .

The Heffernan-Tawn Mixture Model
We focus our discussion on fitting the HTM model given a known number of mixture components K. Inference for K and accounting for its uncertainty are generic in mixture modeling and a reversible jump MCMC mixture model could be applied, see Richardson and Green (1997). Instead, we adapt a simple pseudo-Bayesian inference approach. As the focus of this article is on the extremal dependence structure, we first estimate the marginal distributions and ignore their inference uncertainty subsequently. We describe a heuristic to select K, and also examine the sensitivity of inferences to this selection. We then proceed to make inferences using standard Bayesian methods. Consequently, unlike in reversible jump MCMC mixture models, we do not account for uncertainty in K in subsequent inference. Details of outline code for inference for model parameters and for the probabilities of extreme events are given in the supplementary material.
For ease of notation, we define θ (k) = (γ k , α k , β k , μ k , σ k , p k ) to be the vector of parameters of component 1 ≤ k ≤ K and θ = (θ (1) , . . . , θ (K) ) the vector containing all parameters of the model. Henceforth, we assume that we have data {(x i , y i ) : x i > u} of size n generated by model (7) for some K < < n. Moreover, we assume that α i < α j for i < j.
In mixture modeling, it is common to introduce a latent random variable J ∈ {1, 2, . . . , K} denoting the mixture component associated with the random pair (X, Y). In performing Bayesian inference, we need to calculate the likelihood of parameters given the data. So, we need to make a computationally convenient assumption on the HTM residual distributions Z k , k = 1, . . . , K. We follow the same arguments as Heffernan and Tawn (2004), and assume for parameter estimation purposes that all residual distributions are Gaussian, that is,Z k ∼ N (0, 1) for all k = 1, . . . , K. This technique based on a (false) assumption is equivalent to estimating equations methods, which are known to yield asymptotically consistent estimators (Zeger and Liang 1986). We stress that the Gaussian assumption is discarded once θ is estimated. So, we get for k = 1, . . . , K We can now simulate index J = j i for each observation (x i , y i ) using Equation (10), and calculate the augmented log-likelihood l (a) via where the log-likelihood l k for data from mixture component k is given by if σ k > 0 and (α k , β k ) satisfy the Keef constraints (Keef, Papastathopoulos, and Tawn 2013), otherwise l k = −∞. Since we now have an expression for the likelihood, we can infer the parameters of the model. In particular, we use an adaptive MCMC algorithm similar to Roberts and Rosenthal (2009). In this algorithm, we do not estimate the mixture probability parameters p k since estimates can be inferred from the estimates of the remaining parameters. A uniform prior over the whole parameter space is used to illustrate performance with no expert knowledge. Other prior choices may be more appropriate for some applications, see Section 3.2. For each drawθ from the MCMC chain, we simulate {j i } n i=1 each with probability (10), and define the residuals of the kth The distribution function H k (z), defined in Section 2.3, is now estimated using the empirical distribution function of {ẑ ki : 1 ≤ i ≤ n with j i = k} andp k := n k /n, where n k denotes the number of observations allocated to component k. The supplementary material details estimation of probabilities of extreme sets for this model.
Selecting an optimal value of K is challenging. In simulation studies discussed in the supplementary material, rather than attempt to fix K, we explore the sensitivity of our inference to the value of K. If an estimate of K were required, we suggest the following two-step heuristic (i) fit the model as described above with K = 1; (ii) The number of modes of the residual distribution conditional on X > v for some v > u is our heuristic estimator for K. We use v > u since the mixture components might not yet have separated at u. Simulations show that when the α i s are relatively distinct, this technique is reasonable.

Incorporating Prior Probability on Asymptotic Dependence
Placing a uniform prior over the parameter space (Section 3.1) implies that the posterior does not put positive weight on the class of models corresponding to asymptotic dependence (i.e., α K = 1). This is because the subset of the parameter space which corresponds to asymptotic dependence is a null set with respect to the uniform prior. Hence, this procedure consistently underestimates the risk of extreme events occurring together, see the simulation study in the supplementary material and discussion in Coles and Pauli (2002).
Here, we show how to sample from the posterior distribution of the parameters of the Heffernan-Tawn mixture model using a prior which puts mass on both asymptotic dependence and asymptotic independence. We will not discuss how to calculate the likelihood given a set of parameters as this is similar to before. Instead, we focus on making good MCMC proposals and calculating the MCMC acceptance probability when the prior puts a positive mass on the event {α K = 1}. For brevity, we consider model (7) with K = 1.
We define the priors of the parameters γ 1 , β 1 , μ 1 and σ 1 to be the (improper) uniform distribution on the parameter space with independent components. Let δ be the Dirac delta function, then the density function of the prior on α 1 is that is, the prior is a mixture which puts weight ω ∈ [0, 1] on {α 1 = 1} and weight (1 − ω) on a uniform(−1,1), similar to that in Coles and Pauli (2002). We choose a Metropolis-Hastings Gaussian random walk update for the parameters γ 1 , β 1 , μ 1 and σ 1 with some fixed standard deviation h. We specify the proposal distribution for α 1 at iteration t with current value α (t) 1 to be given by min{1, α (t) 1 + hZ} for standard Gaussian distribution Z such that there is a positive probability of proposing the asymptotic dependent model. The proposal density g(·|α (t) 1 ) is thus given by h for x ≤ 1, where ϕ and are the density and distribution function, respectively, of a standard Gaussian. The Metropolis-Hastings acceptance ratio α MH given a proposal θ (t+1) can now be derived using standard techniques, see the supplementary material.

Inference for the Quantile-Regression Model
We consider two data-generating frameworks, namely models (3) and (7), and we show how to use quantile-regression techniques to infer the parameters without distributional assumptions. In Section 2.4, we parameterized the conditional quantile function q τ via the parameter vector ω τ . First, consider a single value of τ , then ω τ may be inferred aŝ where the check function ρ τ : R → R is defined as ρ τ (z) = z (τ − 1{z < 0}), see Koenker and Hallock (2001) and Koenker (2005). The check function is locally linear hence the estimator is robust to outliers. Now let m ∈ N and 0 < τ 1 < · · · < τ m < 1. We infer conditional quantile functions q τ j , j = 1, . . . , m, jointly usinĝ for weights c j > 0 specified by the user, see Bondell, Reich, and Wang (2010). We apply the methodology to equidistant τ j on the range [τ 1 , τ m ] ⊂ (0, 1), for which it is natural to let c j := 1 for all j = 1, . . . , m. If τ j are not equidistant, then the choice for the weights c j can be adjusted to reflect this. If the parameter space m was equal to the Cartesian product of the marginal parameter spaces m , then the joint estimation procedure would be equivalent to applying Equation (11) separately for each quantile. We only consider models with ω τ = (ϕ, ζ τ ) where ϕ is common across ω τ for all τ and ζ τ is specific to τ . Now we have jointly estimated conditional quantiles q τ 1 (x), . . . , q τ m (x) for fixed quantile levels τ i for i = 1, . . . , m, we give an estimator for q τ (x) for any τ ∈ (0, 1) \ {τ 1 , . . . , τ m }. To do this, we only need to estimate ζ τ sinceφ is already available. We estimate ζ τ bŷ and soq τ (x) := q(x|(φ,ζ τ )) for q τ (x).
The above framework requires adjustment for inferring mixture model (7) when the mixture probabilities are unknown. We discuss this for K = 2 but it generalizes for general K. For τ 1 < · · · < τ m , we define m 0 as the index for which τ m 0 < p 1 < τ m 0 +1 , where p 1 is the mixture probability corresponding to the first component, and separate our presentation into cases where m 0 is known (falsely) and unknown.
In fitting the model, we have chosen not to impose the condition that the inferred conditional quantile functions are increasing in τ for each x for the following reasons. Simulation studies showed that conditional quantile functions usually do not cross if there is little to no overlap of the different components. When they do cross, the intersection is near the threshold u and not at large values. Hence, if noncrossing is of interest, then this can be achieved by increasing the threshold sufficiently. The tradeoff is that less data will be used for inference and the variance of the estimates will increase. Moreover, if there exists a significant amount of overlap, then requiring quantile-functions to not cross whilst also requiring that they take on the form in Equation (8), can result in arbitrarily large biases.

A Subasymptotic Conditional Mixture Model
The extension of the HT model developed in Section 2.3 assumes that mixture probabilities on the conditional distribution Y|X are constant with respect to X, when X takes an extreme value. This model adds significant flexibility over the HT model, however, it does not include the case where the mixture probabilities p k (x), k = 1, . . . , K are varying with the level x. In particular, if there exists a component k with p k (x) → 0 as x → ∞, then for model (7) to fit well, the threshold u needs to be raised until p k (x) ≈ 0 for x > u. This is potentially very inefficient, given the subsequent loss of data to fit the model. So, here we extend model (7) to a nonstandard subasymptotic version of the Heffernan-Tawn mixture model, where the mixture probabilities change with the level of extremity of X.
We are motivated by the following theoretical example. Let, λ > 1, 0 < t < 1 and where exp(λ) denotes the exponential distribution with rate λ.
is the inverse cumulative distribution function of a standard Laplace, F X and F Y are the distribution functions of X and Y.
Calculations in the supplementary material show that for large X L , Thus, for α 2 in model (7), α 2 = λ > 1, but for x > 0, Note that we cannot model the distribution of (X L , Y L ) well under the framework of model (7). We extend model (7) by allowing p k to be a function of X, that is, where α 1 < · · · < α K . In contrast with the Heffernan-Tawn mixture model, there could exist 2 ≤ k 0 ≤ K such that α k 0 −1 < 1 < α k 0 , as illustrated in the motivating example, where k 0 = K = 2. This model is equivalent to Equation (7) if p k (x) is constant for all x > u and k = 1, . . . , K, but it is different for generic p k (x). With this new model form, we capture nonconstant mixture probability. It is not trivial to determine the valid parameter space of this new model, however, Theorem 1 provides some insight.
Theorem 1. To ensure that model (16) for Y|(X > u) does not violate the properties of the marginal distribution of Y, for all k = 1, . . . , K we require The proof can be found in Appendix A. If K = 2 and p 2 (x) = e −(c−1)x for c ≥ 1, then this result implies that a necessary condition is α 2 ≤ c. So α 2 can be larger than 1 as long as it is less than c. More generally, we consider the class of parametric functions where a k > 0, b k ∈ R and c k ≥ 0 for all k = 1, . . . , K.

Oceanographic Data Analysis
We investigate the oceanographical variables T 2 and H S from the NORA10 hindcast dataset of Reistad et al. (2011). These variables are 3 hourly summary statistics that characterize the ocean environment; T 2 is the wave period and H S the significant wave height. We apply the methods introduced above to data from a site in the northern North Sea, see, for example, Figure 1. from Konzen, Neves, and Jonathan (2021). This site is scientifically interesting as it displays seasonal and directional variability in the ocean environment.
To avoid issues with temporal and directional dependence in the observed data, we preprocess the data by combining an established peak-picking method of Randell et al. (2015) and keeping observations that are associated with storms originating from the Atlantic ocean, see supplementary material. This method is used to identify a subset of storm peak observations of H S and associated values of T 2 which are approximately temporally independent. The main idea behind this method is the underlying assumption that consecutive storms are independent events. From the preprocessing, we reduce 176,765 observations recorded over the period 1957 − 2018 to 1597, and we denote the observations corresponding to significant wave height and wave period with the labels H S,peak and T 2,ass , respectively. Figure 1 shows a scatterplot of the original and storm peak samples. Figure 1 shows that conditional on T 2 , H S either takes on relatively small or relatively large values, whereas intermediate values are rare suggesting a mixture model with at least two components. We compare 4 mixture models applied to these data: the HT(K) and QR(K) models with K = 1 and K = 2. We define QR(K) and HT(K) to be abbreviations that correspond to the fits of the HTM model using the quantile-regression model and the Heffernan-Tawn mixture model, respectively, with a fixed number K of components.
We also consider two intuitive "partitioning" methods that accommodate the mixture structure by partitioning the data into swell waves and wind-generated waves. We allocate observations to either component using wave steepness, a quantity proportional to H S,peak /T 2 2,ass . This approach is physically wellmotivated and often adopted by ocean engineering practitioners. Observations with steepnesses below a threshold value are allocated to the swell component, the remainder to the windgenerated component. For our data, the marginal distribution of steepness is bimodal, the lower mode corresponding to swell and the upper to wind-generated waves. It is therefore natural to choose the threshold as the value that lies between the modes and has a minimal estimated density. After partitioning, we fit single component models to swell and wind wave subsamples using the quantile-regression and Heffernan-Tawn models. We denote these models as Part-QR(1) and Part-HT(1), respectively. Finally, the joint distribution is given by a mixture of the two inferred single component distributions, with mixture weights determined empirically. Here, the context of the data allows us to undertake this method of partitioning. In general, this method is application dependent and requires expert knowledge.
Since the models introduced in this article assume data on standard Laplace margins, we transform the data to standard Laplace scale using standard extreme value techniques (Coles and Tawn 1994). Specifically, we proceed as follows: (i) select a marginal threshold; (ii) below the threshold, transform using the empirical probability integral transform; and (iii) above the threshold, use the generalized Pareto distribution. Here, we take the 70% quantiles of the marginal distributions of T 2,ass and H S,peak although the fitted generalized Pareto distribution was found not to be overly sensitive to the choice of threshold. Both estimated generalized Pareto shape parameters were negative, implying finite upper endpoints of the marginal distributions of T 2,ass (95% confidence interval [15.4s, 21.3s]) and H S,peak (95% confidence interval [16.8m, 24.4m]) for our data. We also took a threshold u for the HT model as the 80% quantile of the standard Laplace distribution. Inferences were found to be relatively insensitive with respect to this choice. Similarly to the simulation studies, we set γ k = 0 for all k = 1, . . . , K in the HT mixture model. For trace plots of the model fits, we refer to the supplementary material.
We fix the originally fitted marginal models to avoid introducing further marginal uncertainty into inferences considering dependence. The uncertainty within the procedures is estimated via the following semiparametric bootstrap procedure. We simulate a dataset of the same size as the original data using the inferred HT(2) model, see the supplementary material for details. Next, we fit the HT(K) and QR(K) models for different values of K and estimate the distribution of the response variable by simulating a large number of observations from the inferred models. Then, we transform the generated sample to original margins using the original inferred fixed marginal models.
One of the assumptions of the HTM model is that the mixture probability is constant as a function of the conditioning random variable although Section 4 shows this need not be the case. So, we design a goodness-of-fit diagnostic, see Figure 3, in which we plot estimates for the mixture probability p 2 as a function of x. Both estimators in this plot are calculated using a sliding window approach with a fixed number of 30 observations per sliding window. The estimatorp 2 (x) is defined as the average of the allocation probabilities, see Equation (10), over its corresponding bin. Additionally, we plot Part-p 2 (x), estimated under the partitioning method, for which allocation to groups is deterministic. Finally, we obtain confidence bounds for both estimators using a semi-parametric bootstrap. From Figure 3, we argue that both estimators provide enough evidence to assume that the true mixture probability is constant as a function of x for x > u. Moreover, there is no evidence to assume that lim x→∞ p 2 (x) = 0. Hence, the HTM model should  be a reasonable approach in modeling the dependence structure of significant wave height and wave period, and there is no need to use the subasymptotic extension of the HTM model.
We compare the different model fits via the inferred joint distribution functions and the distributions of synthetic structure response variable R, illustrative of the response of floating offshore structures to wave loading, considered by Ross et al. (2020), where a, b and c are structure-specific parameters with c being a resonant frequency. We take one response variable that is parameterized using Equation (18) with (a, b, c) = (2, 1, 16).
This choice corresponds to structural responses with resonant frequencies in the near and far tail of the distribution of wave period.
Results are summarized in Figures 4 and 5 and parameter estimates are given in Tables 1 and 2. In Figure 4, we plot the inferred joint probabilities, see supplementary material for details, for each rectangle on a discrete grid of rectangles covering the (T 2,ass , H S,peak ) domain [12, 16.5] × [3.1, 15.1]. The partitioning methods, QR(2) and HT(2) models generate similar estimates showing two distinct arms in the dependency structure with increasing T 2,ass . In contrast, the HT(1) and QR(1) models do not capture the mixture structure as well. However, closer inspection of the HT(1) and QR(1) estimates also provides evidence for two arms in the dependency structure. This is  due to the fact that the corresponding residual distributions are bimodal, itself suggesting the need for a two component mixture model, that is, K ≥ 2, see the end of Section 3.1. We plot the results of the analysis for R in Figure 5. The top left corner gives estimated return levels using the 6 inference methods. The remaining panels show the semi-parametric bootstrap uncertainty of these estimates. We note that for a return period less than 1,000,000 years, the one component models estimate a smaller return level of the response R compared to the two component models; and higher return level for a larger return period. Moreover, the one component models tend to have wider confidence bounds. Finally, the gray curve (representing the distribution calculated using the datagenerating HT(2) model) lies within the 95% confidence bounds of all of the methods.
From Tables 1 and 2, we note that the parameter estimates for the HT(K) and QR(K) models are similar across the different methods. The confidence intervals tend to be wider for the one component models which is explained by a combination of model misspecification and the Keef constraints. The model parameters for the Part-QR(1) and Part-HT(1) models are significantly different from the QR(2) and HT(2) models. This is an expected feature due to the separate marginal transformations of the partitioned dataset. These parameter estimates for α and γ are strongly negatively correlated, so representations to obtain orthogonality of these parameters may be helpful. We conclude that the two component models should be used when applicable since the confidence bounds on the return level estimates are tighter. There seems to be little difference between QR(2) and HT(2), hence either could be used. The estimates that are generated using the partitioning methods, especially Part-QR(1), have an even smaller variance when compared to QR(2). However, the partitioning methods rely on considerable prior understanding of the underlying physical processes.

Supplementary Material
Supplementary material includes results of simulation studies, code and data, and other supporting material as referred to in the main text (.zip file).

Appendix A: Proof of Theorem 1
Let X, Y ∼ Laplace(1). We consider the following simplified version of model (16): with 0 ≤ α k < α k for all 1 ≤ k ≤ k ≤ K, and where p k : R → [0, 1] are functions such that for all x, K k=1 p k (x) = 1. We will find a necessary condition on α k given p k such that the model formulation for Y does not contradict the assumption that Y is marginally distributed as a standard Laplace random variable. We need 2 · P(Y > y) = e −y for all y > 0 as this implies that for all u ∈ R and y > 0, P(Y > y, X > u) ≤ 1 2 e −y . Using model (16), we simplify this probability as follows: P(Y > y, X > u) = from writing p 1 (x) = 1− K i=2 p i (x). We note that if α k ≤ 1 for all 1 ≤ k ≤ K, then by bounding p k (x) ≤ 1, we get P(Y > y, X > u) ≤ e −y /2. Moreover, if α 1 > 1, then we trivially have P(Y > y, X > u) ≥ e −y /2. We will assume from here onwards that we have an index 1 ≤ j < K such that α 1 < α 2 < · · · < α j ≤ 1 < α j+1 < · · · < α K .
We are now in a position to prove the theorem by contradiction. To that end, assume that there exists an 1 ≤ i 0 ≤ K such that for some ε > 0. Note that this expression excludes i 0 ≤ j as lim inf x→∞ − log p i 0 (x) x + 1 + ε > 1. Hence, we deduce that i 0 > j. Hence, there exists an x such that for all x > x , A i 0 (x) > α i 0 . For y > max{α K x , α K u}, we get that Equation (A2) simplifies as follows:

Now, define
> 1 2 e −y/α 1 + 1 2 where in the last line we used that A i 0 (x) > α i 0 and i 0 k=2 y α k , y α k−1 = y α i 0 , y α 1 . Since α i 0 > 1 there exists a y > max{α K x , α K u} such that for all y > y , we have e −y/α 1 − e −α i 0 ·y/α 1 > 0. Then for all y > y and u ∈ R, P(Y > y, X > u) > 1 2 e −y . Thus, lim inf x→∞ A i 0 (x) > α i 0 contradicts with the marginal distribution of Y. We conclude that for all i ≥ j + 1, we need to have lim inf x→∞ p i (x)e (α i −1)x ≤ α i , which rearranged gives A symmetrical argument gives the same bound for −α i , concluding the proof.