Do price trajectory data increase the efficiency of market impact estimation?

Market impact is an important problem faced by large institutional investors and active market participants. In this paper, we rigorously investigate whether price trajectory data from the metaorder increases the efficiency of estimation, from the view of the Fisher information, which is directly related to the asymptotic efficiency of statistical estimation. We show that, for popular market impact models, estimation methods based on partial price trajectory data, especially those containing early trade prices, can outperform established estimation methods (e.g. VWAP-based) asymptotically. We discuss theoretical and empirical implications of such phenomenon, and how they could be readily incorporated into practice.


Introduction
Market impact is a crucial feature of the market microstructure faced by large traders, and it manifest itself through the adverse effect on the price of the underlying asset generated from the execution of an order.In other words, upon completion of the trade, aside from direct costs (i.e., commissions/fees), slippage from effective bid-ask spread or delay/timing risk, investors are also subject to the transaction cost generated from the price impact of their own actions [Robert et al., 2012].For example, a trader who needs to liquidate a large number of shares will take liquidity from the Limit-Order Book (LOB) and push the price down, resulting in a implementation shortfall (see [Perold, 1988]) or liquidation cost (see [Gatheral and Schied, 2013]), i.e. the difference between the realized revenue and the initial asset value.Besides the short-term correlation between price changes and trades, or the statistical effect of order flow fluctuations (see [Bouchaud, 2010]), one notable explanation for this dynamics of market impact relates to the reveal of new, private information through trades, which dates back to the seminal work of [Kyle, 1985].As shown in [Kyle, 1985], for an investor with private information, to minimize execution cost or the revelation of information, the optimal strategy is to trade incrementally through time.In fact, in modern electronic markets, a decision to trade a large volume (i.e., the full order, also referred to as metaorder) is typically implemented by a sequence of incremental executions of the smaller orders (referred to as child orders) over a scheduled time window.As discussed in [Zarinelli et al., 2015, Curato et al., 2017], the optimal way to split metaorders depends on the cost criterion and the specific market impact model.
The functional form of the price impact along with its corresponding parameters critically characterize these market impact models.First, a large fraction of the market microstructure literature is dedicated to expressing the optimal execution strategies under different risk criteria as a functional of the model parameters (analytically [Gatheral et al., 2012, Adrian et al., 2020] or implicitly [Dang, 2017, Curato et al., 2017]).Moreover, the functional forms/parameters also determine the viability/stability of a market impact model, as models with certain functional forms/parameters admit different types of price manipulation strategies (see [Huberman and Stanzl, 2004, Alfonsi et al., 2012, Gatheral and Schied, 2013]) which potentially lead to risky/unstable trading behavior and mathematically preclude the existence of unique optimal execution strategies [Gatheral et al., 2012].Second, the estimation of model parameters has important implications in explaining and understanding various stylized facts/empirical findings including the square-root impact law ([Grinold andKahn, 2000, Bucci et al., 2019]), the logarithmic dependence in market impact surface ([Zarinelli et al., 2015]), the power law decay ( [Bouchaud et al., 2003, Gatheral, 2010]) and so on.Finally, the design of trading strategies that minimize execution cost, as well as pre-trade analytic software that delivers a reliable pre-trade estimate of the expected trade cost, relies crucially on an accurate and efficient estimation of the model parameters.Indeed, following the prevalence of automated trading algorithms, both of the above have become standard considerations for active market participants, especially large institutional investors.
Despite its theoretical importance in market microstructure literature and its practical importance among institutional traders, there are relatively few studies (both empirically and, to a greater extent, theoretically) on the parameter estimation problem for market impact models and we aim to fill this gap in this paper.One main reason is the limited access to the metaorder data for academics and practitioners, as these data are typically proprietary data of brokerage firms or funds.Consequently, most empirical studies on the market impact estimation are based on publicly available datasets, which collectively suffer from certain inherent drawbacks (e.g., limited information on trades being initiated by buyer vs seller, an unknown number of metaorders, unknown trading duration [Almgren et al., 2005]) and can only provide a "partial view" of the market ( [Zarinelli et al., 2015]).Another possible reason is the limited consensus on the appropriate model for price impact (e.g., linear vs non-linear, permanent vs temporary, transient vs instantaneous, see [Bouchaud, 2010, Cont et al., 2014]).Notably, a few exceptions include ( [Almgren et al., 2005, Moro et al., 2009, Zarinelli et al., 2015]) which conducted empirical investigations with large metaorder datasets, but the model fitting procedures typically relied only on some summary statistics during the execution.For example, in [Almgren et al., 2005], the authors tried to determine the exponent of the power law functional form in price impact, for which a nonlinear least square regression is carried out with two statistics: the realized impact and the permanent impact (which we explore in detail later).In [Zarinelli et al., 2015], in order to fit the temporary market impact surface, the market impact measured at the moment when the metaorder is completed was regressed on the metaorder duration and the metaorder participation rate.Moreover, it was also suggested in [Curato et al., 2017] that, as "one of the major attractions of the propagator model to practitioners", the historical execution data on the cost of VWAP executions1 (Volume Weighted Average Price, can be seen as another summary statistics, similar to the realized impact in [Almgren et al., 2005]) can be easily used to estimate the functional forms/parameters of the instantaneous market impact function.
While these summary statistics contain important information about the price trajectory during the order (e.g.cost of VWAP involves averaging of price along the trajectory), it seems uneconomical to simply discard a large part of the price trajectory data during model fitting, especially considering how, from an intuitive standpoint, besides the trade cost itself, the trajectory taken by the price movement to arrive at that cost (referred to as the "master curve" shape [Lillo et al., 2003]) could potentially reveal extra information for model fitting (see also [Briere et al., 2020]).For example, it has been empirically observed that the market impact of metaorders is concave with respect to the order size [Zarinelli et al., 2015, Tóth et al., 2011], from which one might conjecture the price movements around the early stages of the trade can be especially informative for predicting the total order cost or the market impact shape.Meanwhile, for the owner of metaorder data (i.e.asset managers or brokerage firms), compared with modeling approaches based on LOB, modeling approaches based on the price dynamics would also be feasible, as the additional collection and storage of these extra price data during the life of order should generally not come at a high cost.On the other hand, it is not unusual to see financial data intentionally discarded (due to structural noise or data corruption) for more accurate estimates.For example, it is common practice (see [Aït-Sahalia andMykland, 2009, Zhang et al., 2005]) when estimating the volatility of an asset return process to throw away a large fraction of the high-frequency data as a way to avoid the contamination of market microstructure noise (e.g., bid-ask spread).In particular, the realized volatility is typically computed by the sum of less frequently sampled squared returns, i.e. selected on time intervals much larger (i.e. 5 or 10 minutes) than the original ones where data are collected (i.e.every a couple of seconds or less), thus effectively discarding a substantial portion of data.Although the bid-ask spread should not have a substantial effect on the market impact model, as trades within a metaorder mostly have the same sign (i.e. a large buy program usually does not contain many sell orders) and most market impact models in the literature do not include a bid-ask spread (see detailed discussion in [Alfonsi and Schied, 2010] regarding a two-sided versus one-sided LOB model), it remains largely unclear/undiscussed whether the extra price trajectory information should benefit the estimation of market impact models.
In this paper, using the Fisher information, which is directly related to the asymptotic efficiency of statistical estimation, we investigate whether the additional price trajectory data increase the efficiency of market impact estimation.We compare the Fisher information matrix (FI) of different statistical experiments constructed from the same underlying stochastic process, and quantify the relative efficiency of their respective maximum likelihood estimators (MLE).The validity of this approach in assessing the optimality of experimental design is rooted in the asymptotic optimality of the MLE estimator in regular parametric models among the class of regular estimators (see, e.g., [Van der Vaart, 2000] or sections below).Specifically, among the popular existing market impact models, we compare different estimators by their respective Fisher information matrix and observe when one information matrix would dominate another, as this implies an asymptotically smaller variance for estimating any quantity of interest under that parametric model, e.g., the impact of metaorder or the cost of execution.To ensure the broad applicability of our findings, we separately investigate two broad categories of market impact models: the Almgren-Chriss model and the propagator models, which cover a large portion of the parametric models that are currently adopted or studied.Whether the price trajectory data could increase the efficiency of estimation is directly related to whether the current statistic is sufficient.Perhaps surprisingly (or even puzzling), we observe that, even when one does not have access to the full price trajectory data, it does not take many price points at all to achieve a more efficient estimation than well-established (also highly intuitive) methods, e.g., VWAP-based estimation method.For example, we show that in the Almgren-Chriss model, even substituting the realized impact data (the terminology from [Almgren et al., 2005], equivalent to the cost of VWAP) with just two price points, one in a sufficiently early stage (within the first quarter of trade duration) of the order and the other one at the end of the order, we would get a strictly more asymptotically efficient estimation.One possible, intuitive explanation could be the concavity of the market impact function (see, e.g., [Zarinelli et al., 2015]), where two carefully chosen points can leverage information more efficiently than VWAP.Similar results can also be observed for the family of propagator models, except more price trajectories are typically more useful.We will explore our findings in detail in Section 3 and 4.

Market Impact Modeling
As one of the central themes in quantitative finance and market microstructure literature, the modeling of market impact is of great interest to practitioners and academic researchers.There is a wide range of literature on market impact modeling for which we only give a partial review here (for a more comprehensive review, see e.g.[Gatheral andSchied, 2013, Zarinelli et al., 2015]).As one of the most well-known and widely adopted models on market impact models, the Almgren-Chriss model in the influential papers of [Almgren andChriss, 1999, Almgren andChriss, 2001] quantifies two distinctive components of price impact: a temporary impact induced by and affecting only an individual's ongoing trade, and a permanent impact affecting all current and future trades equally.Under the Almgren-Chriss model, [Almgren, 2003] extended the analysis in [Grinold andKahn, 2000, Loeb, 1983] and solved the optimal execution problem by framing it as a mean-variance optimization between the expected execution cost and its variance (representing the uncertainty/liquidity risk during execution).The optimal execution problem was also investigated, from an optimal control perspective, by [Bertsimas and Lo, 1998, Forsyth et al., 2012, Forsyth, 2011, Gatheral and Schied, 2011] under the geometric Brownian motion assumption for the unaffected stock price process, rather than the arithmetic Brownian motion (ABM) assumption in the Almgren-Chriss model.The discrete and continuous time variants of these models ([Bank and Baum, 2004, Almgren and Lorenz, 2007, Brunnermeier and Pedersen, 2005, Carlin et al., 2007, Cetin et al., 2010]) are collectively termed by [Curato et al., 2017] as "first-generation" market impact models, as opposed to the "second-generation" models ( [Bouchaud et al., 2003, Bouchaud et al., 2006, Gatheral et al., 2012, Gatheral, 2010, Obizhaeva and Wang, 2013, Alfonsi et al., 2010]).The "second-generation" models, among which the propagator model is perhaps the most notable representative, postulate that the price impact should be neither permanent nor temporary, but transient, as it decays over time [Moro et al., 2009, Lehalle andDang, 2010].As one of the pioneering "second-generation" models, [Obizhaeva and Wang, 2013] proposed a model with linear transient price impact and exponential decay, by modeling an exponential resilience for LOB.The model based on dynamics of the LOB was further developed by [Alfonsi et al., 2010, Alfonsi et al., 2008, Gatheral et al., 2012, Curato et al., 2017, Avellaneda and Stoikov, 2008, Bayraktar and Ludkovski, 2014, Cont et al., 2014, Guéant et al., 2012], which include non-linear price impact, as well as LOB with general shape function and time-dependent resilience.On the other hand, instead of modeling the dynamics of LOB, the discrete-time and continuous-time propagator models developed by [Bouchaud et al., 2003, Gatheral, 2010] directly model the dynamics of price, using decay kernels to reflect the transient nature of the market impact.Detailed discussions on the connection and comparison between these two approaches, as well as further generalizations of propagator models, can be found in [Bacry et al., 2015, Gatheral et al., 2012, Alfonsi and Schied, 2010, Donier et al., 2015, Tóth et al., 2011, Curato et al., 2017].Finally, aside from the aforementioned approaches focusing on the interactions between large trades and price changes, other approaches from alternative perspectives also provided many valuable insights on the price impact dynamics.For example, [Cont et al., 2014] investigates how price changes are driven by order flow imbalance in the order book events (e.g., quote events); [Cardaliaguet and Lehalle, 2018] also investigates the optimal execution problem in a mean-field game setting, where the trader strategically deals with the uncertainty in price and other market participants.

Stochastic Process and Asymptotic Statistical Inference
In this section, we briefly review some of the basic concepts from asymptotic statistical inference and discuss how they are related to our setting where the underlying price evolves as a continuous-time stochastic process.In the market impact literature (see e.g., [Gatheral et al., 2012, Almgren andChriss, 2001]), it is typical to assume the "unaffected" stock price follows an arithmetic Brownian motion (ABM).The term unaffected price process is used in [Gatheral et al., 2012] which refers to the price-determined market participants other than ourselves, i.e., diffusion excluding the drift component.As mentioned in [Almgren and Chriss, 2001], while it could be beneficial to consider geometric Brownian motion (GBM) or correlation in price movements for longer investment horizons, ABM remains a suitable model for the unaffected stock price (no drift term or no information on the direction of movement) over the shortterm horizon, and its difference with GBM in this regime is almost negligible.In fact, it was investigated in [Gatheral and Schied, 2011] that, under the Almgren-Chriss model, the cost-risk efficient frontier under GBM and ABM assumption is "almost identical" and in [Almgren and Chriss, 2001] that, the improvement gained by incorporating short-term serial correlation in price movement is also small.Indeed, as spelled out in [Merton and Samuelson, 1992], most theoretical models in finance use a continuous-time diffusion process (or general continuous-time Markov process) driven by stochastic differential equations (SDE).However, as documented in [Grenander, 1950], the extensive literature on stochastic process "rarely touched upon" questions of inference.A notable reference for applying an asymptotic approach in statistical estimation/inference to SDE is [Bishwal, 2007], where the author established the asymptotic property of MLE using the Radon-Nikodym derivative (likelihood) when the whole continuous sample path from the SDE can be observed and sampled.Yet, in practice, the observed data can only possibly be discretely sampled in time and most estimation methods also require discrete input [Aït-Sahalia andMykland, 2004, Aït-Sahalia, 2008].Thus, the parameter estimation and inference problem for discretely, or sometimes non-synchronously/randomly observed diffusion processes (e.g., high-frequency trading) is of much more practical importance.
The parameter estimation for drift parameters in diffusion processes based on discrete observations has been studied by a few authors (see [Bishwal, 2007] for a comprehensive review).Although a variety of estimating methods for such data exists, a large portion of such studies utilized (quasi-) likelihoodbased estimation/inference [Bibby et al., 2010].While the maximum likelihood estimation (MLE) can be a natural choice, the key difficulty is that despite the Markovian nature of diffusion processes allows one to readily calculate the log-likelihood function of discretely sampled observations simply as the sum of successive pairs of log-transition function, the transition densities themselves from one time point to another generally do not have any finite or known analytic forms, except in some special cases.In order to implement the efficient MLE-based methods, many attempts have been made to approximate the likelihood function which leads to approximate maximum likelihood estimators (AMLE).Notably, the ground-breaking series of works in [Aït-Sahalia, 2002, Aït-Sahalia and Mykland, 2004, Aït-Sahalia, 2008] proposed asymptotic expansions of the transition density for the diffusion process which could be used for approximation.Following this line of work by Aït-Sahalia, various subsequent analyses and noteworthy applications have been developed (see [Li, 2013, Chang andChen, 2011] for a review), as well as other numerical, simulation-based approaches to approximate likelihood (see, e.g., [Pedersen, 1995]).
Fortunately, in market impact models, by invo the ABM assumption for the price process, one can evade the technical difficulty of transition density approximation for likelihood-based estimation of impact (i.e., drift term) because the joint distribution of discrete price observations follows a multivariate Gaussian.Within this canonical statistical model, a wide range of basic results in asymptotic inference, especially estimation for regular parametric models, are readily available.Precise definitions on regular experiments or regular parametric models can be found in [Bickel et al., 1993], which we also specify in the appendix for completeness.As termed in [Bickel et al., 1993], a regular parametric statistical experiment has a "nice" parameter space in θ ∈ Θ ⊆ R K and a density "smooth" in θ.Most importantly, a regular statistical experiment possesses a non-singular Fisher information matrix at each point θ ∈ Θ.As we shall see, the relationship between MLE and Fisher information matrix plays a crucial role in our understanding of asymptotic optimality.

Asymptotic Optimality and Experimental Design
MLE is a ubiquitous method in statistical inference and it has many desirable properties in terms of efficiency, feasibility, and generality.In fact, it has been argued that MLE attains asymptotically optimality among the classes of regular estimators (precise definitions on regular estimators can be seen in [ Van der Vaart, 2000], which we specify in the appendix for completeness.Intuitively, a regular estimator admits a considerable regularity where a small change in parameters does not change the distribution of the estimator too much.).For example (see section 7&8 in [ Van der Vaart, 2000] for results stated below), in this regime, the local asymptotic normality (LAN) and Lipschitzness of log-likelihood can be used to establish the √ n -convergence of the MLE estimator to the true parameter under a Gaussian distribution with the inverse of Fisher information matrix as its covariance matrix.This limiting property attained by the MLE, as shown in the Hájek-LeCam convolution theorem and its variant, is the "best" limiting distribution asymptotically for any regular estimator, in the sense that, it is 1) locally asymptotically minimax for any bowl-shaped loss function, i.e., non-negative function with level sets convex and symmetric around the origin, 2) achieves the lowest possible variance (i.e., a quadratic form based on the inverse of the Fisher information matrix) for any asymptotically regular sequence of the estimator and 3) any improvement over this limit distribution can only be made on a Lebesgue null set of parameters.The asymptotic efficiency of MLE has also been discussed in the sense of Bahadur's asymptotic efficiency or C.R.Rao's efficiency (see [Ibragimov and Has' Minskii, 2013]).A more well-known result, regarded as a simpler version of the Hájek-LeCam convolution theorem, is the Cramér-Fréchet-Rao information lower bound, which also establishes the asymptotic variance lower bound as the inverse of Fisher information under unbiasedness.Notice the various prerequisites one must declare before one claims the asymptotic optimality of MLE.This is not a mere technicality because, aside from the fact the optimality criterion is not singular in nature, various counter-examples exist outside the confine of such conditions.For example, it is well-known that James-Stein's shrinkage estimator ([Lehmann and Casella, 2006]) achieves strictly smaller risk for estimating the mean of a K ≥ 3-dimension multivariate Gaussian with identity covariance matrix under quadratic loss when compared to the MLE (i.e., sample mean).However, the James-Stein estimator is not regular and the improvement over MLE here is for finite sample scenarios, not asymptotic ones.For a counter-example with asymptotic improvement on a Lebesgue null set, one can check the famous Hodges' estimator [Van der Vaart, 2000].
The discussion above does not aim to debate whether one should necessarily use MLE for the estimation of market impact models.Rather, one can make the observation that, the asymptotic variance attained by MLE, being the "best" (or "lowest") possible as the inverse of the Fisher information matrix, quantifies an upper limit on how efficiently one can learn the parameter from a given statistical experiment.As a result, one naturally questions whether one can, by designing statistical experiments that maximize Fisher information in some sense (more about it below), reduces uncertainty in parameter estimation.Indeed, this line of work is pursued extensively in experimental design literature, where the Fisher information matrix has been used to measure the amount of information gained and to design optimal experiments.For example, recently [Durant et al., 2021] uses Fisher information to optimize the experimental design in neutron reflectometry.In this paper, we investigate whether the statistical experiments based on price trajectory are more efficient than the ones based on certain summary statistics.In the experimental design literature (see, e.g., [Fedorov, 2010, Whittle, 1973, Wolkenhauer et al., 2008, Chaloner and Verdinelli, 1995]), a unifying, single optimality-criteria for designing experiments has been studied.Traditional methods include maximizing expected trace, minimal eigenvalue, or determinant of Fisher information, corresponding to so-called A-optimality, E-optimality, or D-optimality.However, we note that in the case of estimating multiple parameters, there is an inherent difficulty in estimating all of them accurately, as an optimal way to estimate one particular parameter may not be optimal for the other ones, especially in the context of market impact models where the scales of parameters (or their variances) are vastly different.As a consequence, the meaning of the traditional criteria becomes less clear, unless the Fisher information from one experiment strictly dominates the other (i.e., their difference matrix is positive semi-definite), which we focus on showing in this paper.

Roadmap and Outline
In this section we clarify the roadmap and main contributions of this paper.As mentioned in the previous paragraph, although we discuss and quantify the asymptotic efficiency of MLE (based on which numerical simulation is conducted), the main discourse is concentrated on comparing the Fisher information of different statistical experiments or designs.The MLE serves as natural candidate for numerical verification and its asymptotic optimality is quantified by the Fisher information; this also justifies the comparison among statistical experiments based on dominance of Fisher information.To this end, we propose an efficient method-that outperforms those based on common summary statistics-for sampling price trajectory data from the underlying continuous-time stochastic process assumed in the market impact models, where efficiency is measured by the Fisher information.The sampling scheme and the analysis revolving around it do not rely on any discretization of the continuous process, although we do consider, for theoretical purpose, discretized price trajectory with fixed increments approaching 0 (similar to the canonical model in [Aït-Sahalia, 2002], with equal spaced discretization).In particular, in the discussion about maximizing the Fisher information, we borrow insight from the concept of sufficiency in an idealized discretization of price trajectory.The flexibility regarding sampling scheme is an important practical consideration because, although experiment design with samples along a single trajectory approaching infinity could theoretically improve the Fisher information, in reality the price trajectory at a scale finer than a certain threshold would not behave as a continuous-time diffusion process and one would practically want to design robust experiments based on fewer representative trajectory data (e.g., 3 or 4 price points).We focus our discussion in this spirit.
The remainder of the paper is organized as follows.In Section 2, we present the basic market impact estimation framework given price trajectory data, including technical lemmas about conditions.In Section 3, we investigate two popular market impact models: Almgren-Chriss and the family of propagator models.The main result for the Almgren-Chriss model, connecting asymptotic efficiency and sampling of three trajectory points, is presented in Theorem 1.The main result for propagator models is presented in Theorem 2. In sharp contrast with the Almgren-Chriss result, it states that the only sufficient statistic is the full trajectory data when considering general instantaneous and kernel functions.Section 3 also shows a numerical study comparing different sampling strategies against VWAP-based estimation methods where the importance of early price data is explored.The last section concludes with discussions on some limitations of the theorems, specifically on both model misspecification and model selection.Simulation and empirical results are placed within each section.Reviews of basic concepts, proofs, and technical conditions are left in the Appendix.
2 Basic Setup and Framework

Background and Model
Throughout this paper, we assume the metaorder is a buy program so that we do not need to specify the sign of a trade.The case for a sell program can be derived analogously.We first focus our discussion on the VWAP (Volume Weighted Average Price) execution strategy, where the trading rate ẋt = v is constant in volume time, where, as a standard assumption in market impact literature, time units are measured by traded volume or volume time instead of physical time.Typically volume time is scaled to adjust for different levels of trading activity during the day, but for this paper, we do not actively distinguish the two times.Consequently, a VWAP execution strategy aims to trade equally/evenly in volume time, which implies trading at a constant proportion against the current traded volume of the stock.In this setting, VWAP strategies can be identified as strategies with a constant trading rate.See [Almgren et al., 2005, Gatheral andSchied, 2013].Note that this is not a restriction on the order types, since we are considering the estimation rather than the optimal execution problem.In particular, different execution strategies can be approximated by sequences of interval VWAP strategies with different trading rates (see Remark 3.1 of [Curato et al., 2017]) and a provably reliable estimation procedure for VWAP execution provides insight for non-VAWP orders as well.Thus, the discussion for non-VWAP strategies shall be deferred to Section 3.3.
We consider a continuous-time model for the evolution of the underlying stock price during the execution of a metaorder during 0 ≤ t ≤ T : where t represents time.The trade duration T and trading rate v are given beforehand, i.e., the SDE in (1) is conditional on (v, T ).For a buy program, the drift term µ θ (t, v) in ( 1) is initialized with µ θ (0, v) = 0 and typically a concave function in t [Nadtochiy, 2022, Lillo et al., 2003] (various forms of µ θ (t, v) will be discussed, see also [Almgren et al., 2005, Gatheral, 2010]) representing the price impact generated from trading at rate v for a period of t.Morevoer, {µ θ } θ∈Θ is a family of impact functions parameterized by θ ∈ Θ ⊆ R K as the parameter space.Given fixed (v, T ) and θ ∈ Θ, {St} t∈T is defined on some filtered probability space (Ω, {Ft} t∈T , P θ ) where Wt is a standard Brownian motion, σ is the volatility and T is the time span (typically T = [0, T ] for some T ).As it is typically the case for market impact models (and in practice), we only aim to calibrate function µ θ (t, v) while we assume σ is fixed or estimated separately.An execution strategy is represented by a continuous function of time {xt} t∈T (here we write xt instead of Xt, but in general xt can also be random and adaptive w.r.t Ft) with x0 = 0 and xT = X indicating the units of shares bought by time t, where X is the metaorder size and T is order duration measured by volume time.In general, impact function µ depends on the entire trajectory of trading rate { ẋt}0≤t≤T , but for VWAP strategy with constant ẋt = v, we can simplify this dependence into only one variable v.
As in [Zarinelli et al., 2015], we characterize two metaorder features (or hyper-parameters): participation rate v and duration T .In particular, if VD is the daily traded volume and VM is the volume traded by the whole market during the order execution period, then we define where both quantities are unitless and are in [0, 1].Notice the definition in (1) would suggest that T • v = X/VD, making the order size X effectively scaled by a factor of 1/VD.For ease of notation, henceforth we assume VD = 1 and treat X as has been scaled.Moreover, as we do not actively distinguish between physical time and volume time so that we retain X = T v. Below are our two canonical classes models from (1).
Example 1.As a first example, in the Almgren-Chriss model [Almgren et al., 2005], price dynamics follow as: where, in view of (1), one can see µ(v, t) = S0(g(v)t+h(v)) and σ = σS0 scaled by S0.As we shall discuss in detail, g represents the permanent impact and h the temporary impact.As S0 can be viewed as fixed and execution cost is typically expressed in basis points (bps), an affine transformation of Pt = S t −S 0 S 0 is carried out in [Almgren et al., 2005] which reduces the dynamics of Pt to the canonical form (1) with µ(t, v) = g(v)t + h(v) and volatility σ.

Statistical Experiment and Asymptotic Inference
In the context of this paper, rather than a statistical model, we speak of a statistical experiment constructed from discrete observations or statistics sampled from (1), as we cannot observe the whole path under the SDE.We briefly review some basic concepts in the statistical experiment and asymptotic inference, with the technical proofs and additional related material provided in the Appendix.We first lay out some assumptions.
Assumption 1 (Model Specification and Identifiability).Θ is an open subset of R K and there exists a unique θ ∈ Θ as the true parameter of the SDE in (1) for all (v, T ) almost surely (a.s.).
Assumption 2 (Metaorder Characteristics).The distribution of (v, T ) in metaorder data does not depend on θ and follows some exogenous distribution function G order .We write it as (v, T ) ∼ G order with pdf (or p.m.f ) g order , i.e., G order (dv, dT ) = g order (v, T )dvdT .We further assume (v, T ) ∼ G order satisfies vL ≤ v ≤ vU and TL ≤ T ≤ TU for some constants 0 < vL ≤ vU < ∞ and 0 < TL ≤ TU < ∞ a.s.
Remark 1.We make a couple of remarks on the assumptions above.We first note that assumption 1 states (1) is well-specified within the parametric family.The discussion for model misspecification is deferred to Section 5.The uniqueness of θ implicitly requires certain identifiability conditions within the parametric family.This is relatively easy to satisfy when the support of G order is not too narrow, because, as long as one parametrizes µ θ carefully, it would then be difficult to find θ1 and θ2 such that they µ θ 1 (t, v) = µ θ 2 (t, v) for all 0 ≤ t ≤ T and all (v, T ) pair almost surely.We shall also see later that such "sufficient variability" condition on G order is crucial in our discussion regarding the nonsingular Fisher information matrix.Such condition on G order is also not restrictive as the metaorder data typically contains various execution styles and order characteristics reflecting the demands or specifications of clients, resulting in a broad range of values for (v, T ) in practice (see the figures on the empirical distribution of (v, T ) from real metaorder data in [Zarinelli et al., 2015]).For assumption 2, the lower and upper bounds on (v, T ) are reasonable, as one typically does not trade too slowly or too fast.On the other hand, the assumption 2 on the exogenousity of G order may not be as straightforward.Indeed, although the distribution G order undoubtedly depends on many exogenous factors such as requests of clients or trading styles, it is not immediately clear whether one can claim G order has no dependence on θ.For example, it is conceivable that the traders would, over time, estimate the parameter up to some accuracy and adapt their trading strategy accordingly for all the exogenous requirements, to satisfy certain optimal trading schedules (e.g., [Almgren, 2003]), resulting in a "shift" of G order towards their acquired knowledge of θ.In this paper, we assume such dependence is negligible, but we do note that assumption 2 could be a source of bias and should be a subject of future deliberation.

Statistical Experiment & Regularity Conditions
We are now ready to discuss the experiment design derived from (1).In this paper, we primarily consider statistical experiments consisting of discrete observations or summary statistics of the following form: given where X {v, T }∪S ∪J .Here S are N price trajectory data (i.e., observations from a single trajectory) with τi ti/T fixed; J are M summary statistics proportional to the average cost (or price) along a certain time window with τ 1 j t 1 j /T, τ2 and simply write P θ (and density function p θ as well) when there is no ambiguity about the sample space in question.One can consider a sample X under (4) generated as: first sample (v, T ) ∼ G order , then generate a sample path based on SDE (1), and finally record S and J .
Based on (1), it is straightforward to see that, conditional on (v, T ), the sample S ∪ J follows a N + M -dimensional multivariate Gaussian distribution with a mean function µ(θ, T, v) given by and a covariance matrix Σ(T ) given by Wtdt, all of which can be readily computed using elementary Itô calculus on standard Brownian motion (e.g., Cov[(Ws, Wt)] = min(s, t)).A notable consequences is that Σ(T ) has no dependence on θ or v.
For example, conditional (v, T ), the cost of VWAP execution CVWAP = v Stdt − XS0 follows Gaussian as in ( 5) where the mean cvwap(θ, T Another example, given a single sample corresponding to price trajectory data X = (v, T ) ∪ {St i } i∈[N ] , the log-likelihood follows (derived using independent increments of St): Then, given n total orders samples Xj, executed by VWAP strategy under (Tj, vj), with {Sj} 1≤j≤n being the j-th price trajectory data Sj = {Sτ i T j } i∈[N ] , the MLE is the estimator that maximizes the log-likelihood ratio: where the second equality follows from (10).
Remark 2. We make a couple of remarks on assumption 3. Assumption 3.1 is standard.Assumption 3.2 is related to the design of the Gaussian experiment, where one must choose S ∪ J so that they are not linearly dependent (otherwise one can simply delete redundant observations).This would ensure that Σ(T ) 0. The uniform lower bound on its eigenvalue for all T , as we shall see, hinges on the lower and upper bound of T in assumption 2. For assumption 3.3, one simply can check whether µ θ (t, v) in ( 1) is continuously differentiable ∀(v, T ) in the bounded support of G order .Assumption 3.4 ensures Lipschitz continuity of µ θ (t, v) in (1) ∀(v, T ) in the bounded support of G order .Assumption 3.5 is different from assumption 1 as P θ is a projection of the probability measure induced from SDE (1) onto a finite-dimensional space (4).Intuitively, the higher the dimension for θ, the higher the dimension of X we need for the model identifiability in (4).Definition 1.With assumption 3.3 in place, we write the derivative of µ(θ, T, v) w.r.t θ as a Jacobian matrix Consequently, given θ0 ∈ Θ, we can define the Fisher information matrix in R K×K for experiment (4): where the subscript in E θ 0 denotes expectation under P θ 0 and the subscript in I X denotes the experiment design in (4) based on sample X.
Before we discuss the relevance of the Fisher information and its importance, we first provide a known result which allows convenient computation for it in a Gaussian experiment (5).
Based on Lemma 1, we have the following Proposition.
Proposition 1.Under assumption 2 and 3, the statistical experiment (4) has its Fisher information in (13) as Proof.Due to assumption 2, we can infer from (6) that Since g order (v, T ) does not depend on θ by assumption 2 and S ∪ J is multivariate Gaussian conditional on (v, T ), we can evoke Lemma 1 and the tower property of conditional expectation to write where the I X|(v,T ) is computed from conditional likelihood, treating (v, T ) as fixed.
Lastly, we present our final assumption which is essential for establishing regular parametric models.
Assumption 4 (Nonsingular Fisher Information).We assume the Fisher information I X (θ) in (15) is well-defined, non-singular and continuous in θ.
As we shall see, assumption 4 is related to G order .Simply put, it would be hard to determine a highdimensional θ in the market impact model if trading style (v, T ) is too singular, as one might struggle to separate different impact effects from v or T .One would need to check assumption 4 on a case-by-case basis.For example, a point mass distribution of G order (i.e., one pair of (v, T )) is almost never capable of fitting a good market impact model.

Asymptotic Optimality and Fisher Information
Now we present some consequences of assumptions 1-4.The results discussed here are related to basic concepts in asymptotic inference, such as regular parametric model, local asymptotic normality (LAN), and regular estimators.We do not dive into the details of these established results and we defer both the proof and the reference for these concepts to the appendix.The purpose of the following proposition is to establish that, under fairly reasonable criteria and a fairly broad class of estimators, the MLE achieves asymptotic optimality given the aforementioned conditions and that optimality is quantified by the Fisher information.
Then, under assumptions 1-4, the MLE is consistent for θ , i.e., θn → θ in probability and √ n- for any φ(θ) differentiable in θ.Moreover, for any bowl-shaped loss function l (i.e., {x : l(x) ≤ c} is convex and symmetric around the origin) and any estimator sequence {Tn} n≥0 , sup where I is taken over all finite subset of R K and The proof is presented in the appendix.The argument follows from section 8.7 of [Van der Vaart, 2000] where the asymptotic optimality of the MLE and its Fisher information is argued from the local asymptotic minimax perspective.Many other optimality criteria exist and are discussed in Appendix as well.The main takeaway is that, if ( 16) is taken as the basis for asymptotic optimality, one should aim to design an experiment that maximizes the Fisher information.To see why, consider two statistical experiments (4) based on different designs X and Y (e.g., price trajectory versus total cost) and one wants to predict cvwap(θ) = cvwap(θ , T, v)G order (dv, dT ) based on θn,X or θn,Y .Then, if cvwap(θ) is differentiable w.r.t θ with gradient ∇ θ cvwap, one can show that, based on ( 16) and the delta method [ Van der Vaart, 2000], the asymptotic variance of the plug-in estimator satisfies which implies θn,X has lower asymptotic variance for estimating cvwap(θ ), or any other differentiable function of θ, simply based on its greater Fisher information.

Sufficiency and Other Lemmas
In this section, we introduce several technical lemmas we shall use later.Following the previous discussion, one might want to maximize the Fisher information of an experiment (4).An important related concept is sufficient statistics.Formally, given X in some statistical experiment (4) and a function φ, one can create another statistical experiment using the statistics φ(X).This typically incurs a loss of information in terms of the Fisher information unless the statistic is sufficient.Here, a statistic φ(•) is sufficient for {P θ } θ∈Θ if the conditional distribution of S given φ(S) is free of θ, under P θ for any θ ∈ Θ.We also give a formal characterization here which we shall use later.
Lemma 2 (Neyman-Fisher Factorization Theorem).Consider the statistical experiment (4) and its sample X.A statistic φ(X) is sufficient if and only if the likelihood function in (4) has the factorization for some non-negative h(•), g(•).
The following lemmas relate a sufficient statistic with the Fisher information.
Lemma 3 (Data Processing Inequality [Zamir, 1998]).Consider the statistical experiment (4) and its sample X.Let φ(X) be a statistic of data, then the statistical experiment based on φ(X) satisfies,for θ ∈ Θ, with the equality obtained if and only if φ(X) is a sufficient statistic for the original statistical experiment.
Lemma 4 (Data Refinement Inequality).Consider the statistical experiment (4) and its sample X.Let U and V be two statistics of the experiments, respective; then the statistical experiment based on them satisfies, for θ ∈ Θ, Moreover, if V = φ(U ) for some φ(•) which is a bijective mapping, then The proofs for Lemmas 3 and 4 are in the Appendix for completeness.In other words, the Fisher information of the experiment using φ(X) is generally smaller than the one using X, unless statistic φ(X) is sufficient, in which case X provides no extra information over φ(X) for estimating θ (thus the equality in ( 19)).Note the original sample X is always, trivially, a sufficient statistic.However, the whole data X is not always required.Example 4. Consider a simple model under (1) which is a random walk with drift expressed by St = S0 + θf (v)t + σWt for some f (•).Let X = (v, T ) ∪ S be the statistical experiment in (4) where S = {St i } i∈ [N ] .It is straightforward to check that the log-likelihood in ( 10) is given by which can be reduced to h(S, T, v) + g(ST − S0, T, v, θ) for some h, g (i.e., isolate the term with θ).Using Theorem 2, we can see that (ST − S0, v, T ) is sufficient.Thus, for this model, price trajectory S is not needed for the estimation of θ, the difference between the end price and start price would suffice (in fact ST would suffice since we assume S0 is known).
In the next two sections, we examine sampling schemes from price trajectory in two popular market impact models: namely the Almgren Chriss Model and the Propagator Model.In particular, given Lemma 3, which establishes the connection between sufficient statistics and the Fisher information, we are able to find some insight in designing efficient statistical experiments.

The Almgren-Chriss Model
The Almgren-Chriss model remains one of the most popular and widely adopted market impact models since its introduction [Almgren and Chriss, 2001] and the paper [Almgren et al., 2005] is dedicated to parameter estimation problems for the model: In Almgren-Chriss model, g is the permanent impact function and h is the temporary impact function.
In [Almgren et al., 2005], h and g are parameterized by power law: 1.In order to facilitate the estimation of the permanent impact in the Almgren-Chriss model, the price trajectory S includes an additional price point ST post where Tpost > T ; this can also be seen from the "piece-wise" dynamic in (20).[Almgren et al., 2005] indicated that the choice of Tpost being 30 minutes after T works well.In this way, in the experiment design (4), we can fix a constant τ delay so that Tpost = T + τ delay , for τ delay = 0.077 (i.e., 30 minutes delay in a day with 6.5 trading hours).In such manner, we remove the randomness of choosing Tpost given T , similar to the way we use fixed τi to choose ti = τiT .
2. The impact functions g and h are further scaled by volatility σ.For a single stock, this can be subsumed in the coefficient γ, η in (21).
3. A liquidity factor in the form of shares outstanding with power law exponent is inserted to characterize the fact that the market impact is not just based on (T, v), but also on the stock's liquidity condition.For a single stock, this can also be subsumed in γ, η in (21).
We will again discuss these modifications in detail in the simulation section where we reproduce and compare with the simulation results in [Almgren et al., 2005].In summary, these modifications facilitate a cross-sectional description of market impact, although we focus on the single stock analysis in this paper.Finally, for our discussion on (20), aside from assumptions in section 2.2, we do not put restrictions on the exact form of h(•; θ), g(•; θ), i.e., we allow them to take in forms other than the power law parametric form (21).With the affine transformation Pt = S t −S 0 S 0 , [Almgren et al., 2005] estimated θ using two quantities termed as permanent impact I and realized impact J: where, under the model ( 20), conditional on (v, T ) the joint distribution of I and J follow a Gaussian one (5): where µI,J (θ, T, v) T g(v; θ) h(v; θ) , ΣI,J (T ) A partial derivation of the above can be found in [Almgren et al., 2005], but we also include a full derivation in the Appendix for completeness.Then, in the parameter estimating phase (section 4.2 of [Almgren et al., 2005]), [Almgren et al., 2005] uses the metaorder data to fit the normalized residuals of (I, J − I/2) using Gauss-Newton optimization algorithm, since this is a non-linear least square problem for the µ given by ( 21).This fits exactly in our frame of (4), as the maximum likelihood estimation (8) for the Gaussian experiment based on I, J in ( 5) is equivalent to solving the following non-linear least squares problem: based on n total orders samples (Ii, Ji) i∈[n] , executed under (Ti, vi) i∈ [n] .This equivalence allows us to connect the parameter estimation method in [Almgren et al., 2005] with the asymptotic theory of statistical estimation in Section 2.2.It follows from Lemma 4, Lemma 1 and Proposition 1, if the Jacobian of µI,J (θ, T, v) w.r.t θ is JI,J (θ, T, v), then the Fisher information for experiments based on • JI,J (θ, T, v) .
In a case-by-case fashion, we prove the following in the appendix, so that our results from previous sections can be applied.
Lemma 5 (Regularity Check).Consider the statistical experiment (4) based on Almgren-Chriss model with power law (21) impact and I, J in (23).Under Assumptions 1 and 2, Assumptions 3 and 4 are satisfied provided that the marginal distribution of G order on v has a support with infinite cardinality.
Lemma 5 provides a sufficient condition for Assumptions 3 (3.5 in particular) and 4 to hold, although it is not necessary.Note that the condition on G order is satisfied for any distribution with a continuous density.By checking the proof, one can also establish the result for discrete distribution G order with carefully chosen support whose cardinality is larger than the dimension of θ.

Sufficient Statistic from a Discrete Price Trajectory
From Lemma 5, in view of Section 2.2 and the regularity check of experiment based on (21), we want to see if we can design experiments with Fisher information larger than II,J .We first investigate this for the canonical price trajectory data of the form S full = {St i } i∈[N +1] on a given grid {∆t, 2∆t, ..., N ∆t = T, Tpost} with ∆t = T /N and N is a fixed number.Notably, this is also the setup for [Almgren and Chriss, 2001], when they consider the optimal execution problem with N -trading period.
In this example, again taking the transform Pt = S t −S 0 S 0 , from (20), we can see that the likelihood for the experiment based on S full , conditional on (v, T ), under Almgren-Chriss model is: for some f1, f2.Using Lemma 2, we can see that (Pt 1 , PT , PT post ) is a sufficient statistic for θ.In fact, (Pt 1 , PT , PT post ) is sufficient, but not minimal sufficient, as (PT , T post−T ), a non-invertible mapping of (Pt 1 , PT , PT post ), is also sufficient.For the definition of minimal sufficient, see e.g.[Keener, 2010] or Theorem 2 below.However, the observation is somewhat artificial to this example.Consequently, from Lemma 3 we have: Although the construction of t1 is artificial and related to the choice of N .However, since the choice of N can be arbitrarily large, it is reasonable to suspect that the insight from ( 27) can be translated to generic price trajectory data S.Moreover, since the post-order price I = PT post , can we actually then substitute the VWAP cost J for two price points Pt (for some t in "early" stages) and PT along the price trajectory data for a more asymptotically efficient estimate?Indeed, in the next section, we shall see these insights in several important ways.

Sampling Strategy along the Price Trajectory
Based on the previous insight, given a experiment design based on partial trajectory data S = {Pt i } i∈[N ] ∪ {PT post } with tN = T , one might ask: 1. Does sampling more than three points along the trajectory (i.e., N > 2) increase the Fisher information?
2. If sampling more than three points does not improve the asymptotic efficiency, is simply sampling two points PT and PT post good enough?
3. For sampling three points, in addition to PT and PT post , how should we pick the extra point Pt for some t ∈ (0, T )?
To answer these questions, suppose we have the following grid of observations Then writing out the density p θ (S|v, T ) as in ( 26), we have for some f1, f2.Following the same reasoning as in the previous section, we see that where t1 = min{t : Pt ∈ S}.Thus, we have the following corollary.
Corollary 1.For the Almgren-Chirss model and experiment based on S = {Pt i } i∈[N ] ∪{PT post }, sampling more than three points on the trajectory does not increase the Fisher information, as: This answers the first question.To answer the second question, one can calculate (the derivation is left in the Appendix): Comparing ( 30) with (25), we note that, in general, neither IP T ,P T post (θ) II,J (θ) nor II,J (θ) IP T ,P T post (θ) can be established.Thus, one cannot definitively say sampling two points is more efficient than using VWAP cost J and post-order price I.
So far, we have shown that sampling two price points PT and PT post is not optimal while sampling more than one point along price trajectory {Pt} t∈(0,T ) is not necessary.This naturally brings us to the third question above.We summarize the answer as our first main theorem.
Theorem 1.Under assumptions 1-4, we sample (Pt, PT , PT post ), as long as t is chosen early enough so that t T ≤ 1 4 , then, under the Almgren-Chriss model (20), we have Moreover, (31) holds for generic forms of g and h and all distributions G order of (v, T ).
Proof.The proof of Theorem 1 is left in the Appendix A.3.3.
Remark 3. Notably, also shown in the proof, the ratio 1 4 in Theorem 1 is somewhat tight, in the sense that for t T > 1 4 , there might exist some form of h and g for which (31) is no longer true.This could be of interest in its own right, as the significance of number 1 4 is not immediately clear from observing (20).Theorem 1 suggests that, at least for the Almgren-Chriss dynamics (20), as long as we sample Pt early enough (e.g., any price sampled within the first 25 percent of the filled order), then MLE estimated using (Pt, PT , PT post ) would outperform the estimation method using I, J in [Almgren et al., 2005] asymptotically, for any distributions of (v, T ), rendering Theorem 1 applicable for metaorders across different magnitudes and for generic form of h and g (specifically the original power law models in [Almgren et al., 2005]).Perhaps interestingly, this result seems to suggest earlier stages of an order are more informative for calibrating impact models.Finally, we use the data based on Section 4.1, 4.2 and Table 3 of [Almgren et al., 2005] to provide an example of the level of improvement based on Theorem 1.More extensive simulation studies verifying Theorem (1) are provided at the end of the section.Example 6.Consider estimating the power law exponent of ( 21) in ( 20).Suppose the underlying true parameters are θ = (γ , η , α , β ) = (0.314, 0.142, 0.891, 0.600).We set the hyper-parameters of the metaorders to be (X, v, T, Tpost, σ) = (0.1, 0.5, 0.2, 0.275, 0.0157).The data is from [Almgren et al., 2005], which can be viewed as one specific instance from G order , i.e., a point mass.We set t = 0.1 • T and (γ , η ) = (0.314, 0.142), so we only estimate (α, β).This avoids non-singular issues from G order being a point mass.Then, one can calculate IP t ,P T ,P T post (α , β ) and II,J (α , β ), which leads to (32) The closed-form expression of IP t ,P T ,P T post (θ) is calculated in Appendix A.3.3.In view of ( 16) and ( 18), using (Pt, PT , PT post ) over (I, J) implies an asymptotic variance reduction (i.e., equivalent to relative efficiency, [ Van der Vaart, 2000]) of 21% for estimating α ( calculated by 4.438−3.5044.438   , or equivalently a 21% increase in sample-efficiency), 51% for estimating β and 18.5% for estimating the average VWAP cost in bps, given by cvwap(α, β) = T 2 γv α + ηv β and ∇ α,β cvwap(α , β ) = [ T 2 γv α log(α), ηv β log(β)].On the other hand, also following Remark 5, given (α , β ) = (0.891, 0.600), suppose ones wants to estimate (γ, η).Similarly one can show a 20.6% variance reduction for estimating γ, 51.5% for η.In the context of [Schied and Schöneborn, 2009], a model of the form (21) with α = β = 1 and a utility u(R) = − exp(−AR) for some A > 0 is assumed (e.g., we let A = 5 below).Then the optimal adaptive liquidation strategy (notice this is a sell-program so x0 = X) is shown to be of xt = X exp(−t σ 2 A 2η ).One can check if η is estimated around η with a 10% error, then the optimal liquidation at T would be off-target by around 6.88%, whereas a 5% estimated error on η would reduce that error to around 3.36%, an improvement close to 50%.

Non-VWAP Executions
In this section, we use the framework from [Almgren and Chriss, 2001] to briefly discuss the estimation for non-VWAP orders under (20).In particular, given full grid observation S full = {St i } i∈[N ] and a sequence of trading rate {vi} i∈[N ] so that a trading rate of vi is executed during interval (ti−1, ti] and T N N i=1 vi = X.We assume trading rates and trading period is all fixed and we only investigate the sufficiency issue here.Thus, under (20) and Pt = S t −S 0 S 0 we have Note that {vi} i∈[N ] needs not to have N distinct values (i.e., |{vi}i| < N ) as the same trading rate is allowed to prevail over several periods of ∆t.To account for such strategy, suppose there are N distinct (N ≤ N ) constant trading rates: {v i } i ∈[N ] with v i = v j for all i = j and each v i corresponds to some union of trading intervals where the trading rate is v i .We represent such a union by a sequence of disjoint, non-adjacent intervals I i = ∪ k (t s,i k , t e,i k ].For example, if v 1 = 0.5 and the order is executed with a trading rate of 0.5 during (t0, t1], (t1, t2] and (t5, t6], then I1 = (t0, t2] ∪ (t5, t6].Note that the disjoint and non-adjacent property makes sure that 1) the representation of I i is unique and 2) for any t being t s,i k or t e,i k in the representation of I i , then 1. either t = 0 or t = T 2. or the interval (t − ∆t, t] and (t, t + ∆] has two different trading rates.
In other words, if we view the time before t = 0 and after t = T as having 0 trading rate, then any t being t s,i k or t e,i k in the representation of I i implies t connects two periods with different trading rates.In this setting, the likelihood can then be written as ,t for some f1, f2, f g , f h .Here an indicator function 1A(x) = 1 if x ∈ A and 0 otherwise.In view of Theorem 2, a sufficient statistic for (34) is {Pt : In other words, the union of Pt's that is either a node, which connects two different trading rate periods, or immediately follows such a node, constitutes a sufficient statistic of S full .This seems to suggest the price points more closely following trading rate changes are more important to the estimation.Using this characterization, we can also recover the sufficiency Pt 1 , PT and PT post in the VWAP case (P0 = 0 is known).

Simulation
In this section, we verify the results from corollary 1 and theorem 1.For a fair comparison, we recover, to the best of our abilities, the IBM stock simulation result in Table 3 from [Almgren et al., 2005].Several practical modifications in [Almgren et al., 2005] are carried out, so impact equations in ( 7) and ( 8) of [Almgren et al., 2005] become: whereas our equation based on (23) gives Here ΘA is called shares outstanding , VA is average daily volume and Θ A V A is called inverse turnover.All concepts are in [Almgren et al., 2005] to model liquidity factor into the market impact, which is not necessary for a single stock analysis.To avoid confusion, the parameters from Table 3 in [Almgren et al., 2005] which are different from ours are subscripted by A. Their transformation with our model parameters can be constructed as Following Table 3 of [Almgren et al., 2005], the initial hyper-parameters for IBM stocks are given as V = 6, 561, 000, Θ = 1, 728, 000, 000, σ = 0.0157, Tpost − T = 0.5/6.5 = 0.077 and T = 0.2.
With these transformations, we can reproduce values in Table 3 in Almgren for IBM using equation ( 35) and our version of the parameters, up to a negligible error (i.e., no discrepancy up to the forth digit).

The Propagator Models
As discussed in [Zarinelli et al., 2015], the linear permanent market impact predicted by the Almgren-Chriss model can deviate from the market impact trajectory for non-VWAP executions and fail to reproduce the concavity of the market impact curve [Zarinelli et al., 2015].The propagator models [Bouchaud et al., 2003, Gatheral and Schied, 2013, Obizhaeva and Wang, 2013], on the other hand, can consistently recover features of market impact curve, including the square-root law, by allowing nonlinear instantaneous function f (•) and a decay kernel G(•) to capture the transient component of market impact.The word transient implies the trade impact is neither permanent nor temporary, but decays over time [Moro et al., 2009, Curato et al., 2017].As discussed in [Bouchaud et al., 2003, Busseti andLillo, 2012], such a transient assumption is consistent with real data and is in line with empirical findings on market efficiency and strong concavity of impact.In the framework of (1), we have3 Notable functional forms of instantaneous impact function f and decay kernel G (36) includes: 1. power-law f (v) ∝ v δ and power-law decay G(s) ∝ s −γ . 4As we shall see, this formulation recovers the square-root law when δ = γ = 1 2 .When δ = 1, this corresponds to the transient linear price impact where the optimal execution strategy can be explicitly solved using Fredholm integral equation of the first kind [Gatheral et al., 2012].For non-linear f , the optimal execution problem is harder to solve and generally requires discretization and iterative optimization [Dang, 2017, Curato et al., 2017].When G(s) = 1 {0} (s), this recovers the temporary impact function h of ( 20) in [Almgren and Chriss, 2001].
2. linear f (v) ∝ v and exponential decay G(s) ∝ e −ρs .This is one of the first propagator models with transient impact proposed by [Obizhaeva and Wang, 2013].The original formulation focused on the modeling of shape/dynamics of the limit order book (LOB), which is later extended by [Alfonsi et al., 2010] for generally-shaped LOB.The connection/equivalence of price impact reversion and volume impact reversion is further discussed in [Alfonsi andSchied, 2010, Gatheral et al., 2011].
is related to the exponent of auto-correlation among trade signs studied in [Bouchaud et al., 2003].The logarithmic impact has also been supported empirically to fit the data across different magnitudes better over the square root function.[Zarinelli et al., 2015].
There are many other forms of f and G (e.g., Gaussian kernel, trigonometric kernel in [Gatheral et al., 2012] as examples of where one can solve Fredholm equation) and propagator models have been extensively studied (e.g., [Gatheral, 2010, Gatheral et al., 2011, Gatheral et al., 2012, Curato et al., 2017, Bouchaud et al., 2003, Bouchaud et al., 2006, Alfonsi and Schied, 2010]) to characterize conditions of the model which prevent price manipulation strategy or guarantee the stability of trades [Gatheral, 2010] in optimal execution strategy.The various forms of f and G are typically modeled separately.As a result, their regularity conditions in assumption 1-4 need to be checked in a case-by-case basis, although they are satisfied for common propagator models.

Sufficient Statistics for a Discrete Price Trajectory
In this section, we fix (v, T ) by letting G order = δv,T be a point mass to simplify the discussion.In the same spirit as Section 3.1, we first study the experiment based on S full = {St i } i∈[N ] on a given grid {∆t, 2∆t, ..., N ∆t = T }.Note that in this setting we no longer include a ST post with Tpost > T to measure the permanent impact, i.e., the height of plateau that peak impact relaxes to height, as its determination might be difficult [Zarinelli et al., 2015].Formally, we have the following characterization of separability before we can state the theorem based on S full , from a parameter estimation point of view as in Section 3.1, Definition 2. The form of f and G in ( 36) is separable if they are modeled by separate parameters, i.e., there exist parameter spaces Θ f and ΘG such that Θ = Θ f ×ΘG and f ( Theorem 2. Fix (v, T ) and consider (5).With a full-grid observations S full = {St i } i∈[N ] , for power-law kernel G(s) ∝ s −γ , G(s) ∝ l0(l0 + s) −γ where γ ∈ Γ ⊆ (0, 1) contains some open set or exponential kernel G(s) ∝ e −ρs where ρ ∈ Γ ⊆ R + contains some open set, assume the instantaneous impact f is positive (i.e., f (v; θ) = 0 for v > 0) and is separable from from the kernel G, then the unique sufficient statistic (up to a one-to-one transformation) for propagator model ( 36) is the full trajectory data S full .
Proof.A sufficient statistic is minimally sufficient if it can be represented as a function of every other sufficient statistic, i.e., X is minimally sufficient if for every other sufficient statistic Y , there exists some function f such that X = f (Y ) (a.e. for all θ).Thus, if we can show S full is minimal sufficient, then for any sufficient statistic X, we know both X is a function of S full (by definition of a statistic) and S full is a function of X (by minimal sufficiency of S full ), thus proving S full is the unique sufficient statistic, up to a one-to-one mapping.Under (1), we have µ(t, v; θ) = f (v; θ) t 0 G(s; θ)ds.Then we can write, as in (26): for some f1, f2.It is known that (see e.g., [Lehmann and Casella, 2006] or Theorem 3.19 [Keener, 2010]) that, for an exponential family with density where the η(θ), T (x) ∈ R k and η(θ) • T (x) denote their dot product, the statistic T (x) ∈ R K is minimal sufficient if (1) {Ti(x)} 1≤i≤k is linearly independent and (2) the dimension of the convex hull of {η(θ)} θ∈Θ has dimension k.The convex hull of a set S is { n i=1 tixi : ti ≥ 0, xi ∈ S, ∀i, n i=1 ti = 1, n > 0}.An equivalent condition for (2) is: there exists {θ0, θ1, ..., θ k } ⊆ Θ such that {η(θi) − η(θ0)} 1≤i≤k is linearly independent.Now, compare (37) with (38), we let are linearly independent, as there are no linear constraints among the price increments St i − St i−1 .Thus (1) is satisfied.For (2), since f (v; θ) = 0 and is separable from G, it suffices to show the convex hull of {η(θG)} θ G ∈Θ G ⊆ R N has dimension N for the various kernel G(•) mentioned above, where we redefine ηi(θG) = t i t i−1 G(s; θG)ds using G only.This is left in the Appendix A.3.4.Finally, the minimal sufficiency of T (S full ) = {St i − St i−1 } i∈[N ] is equivalent to the minimal sufficiency of S full since S0 is given.
Theorem 2 shows, when fitting a general propagator model, only the entire price trajectory provides the whole information, not any summary statistics (e.g., cost of VWAP, or the three points as in Theorem (1) for Almgren-Chriss model).To see more clearly why this should be the case, consider a simplified example.
Example 7. Suppose we are estimating a one-dimensional parameter θ ∈ Θ ⊆ R from (1) and (v, T ) is fixed.We can estimate θ using either the average cost of VWAP (up to a constant factor v): or the entire path S full .Using Lemma 1, the Fisher information IJ (θ) is whereas, using Lemma 1 and the independence of {St i − St i−1 } i∈[N ] (multivariate Gaussian), the Fisher information I S full (θ) is (let ∆t ↓ 0 and assume boundary condition ∂µ θ (0,v) ∂θ = 0): and assume h (0; θ, v) = 0. Then lim∆t→0 I S full (θ) ≥ IJ (θ) then immediately follows from the inequality: where the second equality follows from the Fubini-Tonelli theorem and the last inequality follows from Cauchy-Schwarz inequality.

Estimation of Impact Function f
In the previous section, Theorem 2 addresses the joint estimation of f and G, but the proof also works for the estimation of G given f .However, Theorem 2 can be adjusted for estimation of impact function f given G, which typically can be calibrated from other methods ( [Bouchaud et al., 2003, Busseti andLillo, 2012]).As discussed in section 2.1 of [Curato et al., 2017], one of the "major attractions of the propagator models to practitioners" is that given G(•) and a large data collection of "VWAP-like executions", the expected cost of VWAP v can be used to estimate f (v) and reflect the performance of execution algorithms.In this section, based on Theorem 2, we quantify the efficiency of using VWAP cost to calibrate f , as well as that of using trajectory data, in the limit case when ∆t → 0.
For fixed (T, v), the information matrix based on VWAP cost alone only has rank 1, so for simplicity, we first assume we are calibrating an impact function f (v; θ) characterized by a one-dimensional parameter θ ∈ Θ ⊆ R. As in Remark 1, the general case could depend on the distribution of (T, v) ∼ G order and we avoid this dependence to isolate the effect.For concreteness, one can suppose a power law impact f (v) = v δ and we want to estimate δ ∈ Γ ⊆ (0, 1].Interestingly, although considered unrealistic for certain cases ( [Gatheral, 2010]), it is recently argued in [Jusselin and Rosenbaum, 2020] that power-law impact is the only impact function consistent with the no-arbitrage condition.Definition 3. Fix S full = {St i } i∈[N ] be price trajectory data.Let S partial = {St}t∈K be a partial price trajectory data with i=1 St i ∆t be the discrete VWAP cost while J = Stdt is the true VWAP cost.Theorem 3. In the same setting as Theorem 2, suppose the decay kernel G(•) is fixed and the impact function f (v; θ) is parameterized by some θ ∈ Γ ⊆ R where Γ contains some open set.Then the sufficient statistics for estimating f in (36 (42) Moreover, in the limit as ∆t → 0, we have and Finally, we have Proof.The proof is left in Appendix A.3.5.
As a straightforward result of Theorem 3, we can compare the Fisher information for calibrating f of using two points (St, ST ) versus J. Corollary 2. In the setting of Theorem 3, we have Consequently, we have IS t ,S T (θ) − IJ (θ) ≥ 0 when For calibrating f , using both Theorem 3 and Corollary 2, we can compute optimal sampling points S that maximize I S or compare its efficiency with an estimation based on VWAP cost.In Theorem 1, we see that, for the Almgren-Chriss model, a sample St early enough with t T ≤ 1 4 can outperform the VWAP based method.For the propagator model, the comparison between (St, ST ) and J depends on the specific decay kernel G as in (47).However, as we shall see in the following examples, the result for the propagator model bears a resemblance to the result for the Almgren-Chirss model in Theorem 1. Example 8.For the decay kernel G(s) = s −γ with γ = 0.4 [Bouchaud et al., 2003], based on (47), we have IS t ,S T (θ) ≥ IJ (θ) when 2.11 • 10 −4 ≤ t T ≤ 0.279.For different values of γ, one can check IS t ,S T (θ) ≥ IJ (θ) when the condition in Table 3 is satisfied.As shown in Table 3, similar to Theorem Table 3: The range of values of τ = t T where I St,S T (θ) ≥ I J (θ).
1, the comparison of IS t ,S T (θ) and IJ (θ) only relies on τ = t T .In general, a smaller value of τ would help (St, ST ) outperform J, as long as it is not unreasonably/unrealistically small (i.e. on the order of 10 −4 or less for certain γ).Perhaps more interestingly, the cut-off point for being "small" is around 1 Example 9.For G(s) = e −ρs , the comparison of (47) depends on specific values of t and T , not just their ratio τ .However, in the case where t, T → ∞ but t T → τ , we have IS t ,S T (θ) ≥ IJ (θ) as long as regardless of the value of ρ.The derivation is left in Appendix A.3.6.

Sampling Strategy for Propagator Models: A Numerical Study
Theorem 2 shows that the full price trajectory S full is the only sufficient statistic for a general propagator model.However, it does not specify how much information is lost when we only have partial observations S, neither suggests a consistent sampling strategy over the cost-of-VWAP based estimation, as in Section 3.2.Unfortunately, unlike Theorem 1, the optimal sampling strategy for (36) depends on the specific values of t and T (not just their ratios) and the specific form of G and f .Thus, in section, we aim to gain some insights into questions (1) and ( 2) through numerical examples.We conduct numerical studies to compare different sampling strategies against the VWAP-cost-based estimation methods, in the flavor of Section 3.2.

Example 1: Power Law
In the first example, we take the power-law kernel G(s) = s −γ with γ = 0.4 [Bouchaud et al., 2003, Busseti andLillo, 2012] and power-law impact f (v) = v δ with δ = 0.6 [Almgren et al., 2005] (hence µ(v, t; θ) = We shall compare the asymptotic variance for calibrating δ, under different estimation methods.In this example, we assume T = 0.1, v = 0.3, σ = 1 γ = 0.4 and δ = 0.6.Assuming we are only estimating δ (i.e., all other parameters are given), then as computed in (40), as ∆t ↓ 0, one can check the entry for δ in the Fisher information matrix We compare estimation methods based on the VWAP cost 3 ) with methods based on price points Si's sampled from trajectory.Again, as in (39), we have ∂δ dt 2 = 0.702, which recovers just over 65.1% information for estimating δ (i.e., 0.651 ≈ 0.702 1.078 ), which implies the asymptotic variance for estimating δ using J is approximately 1.536 ≈ 1/0.651 times larger than the asymptotic variance for estimating δ using S full .Next, based on results from Theorem 1, we pick the following representative sampling times t1 = T 8 , t2 = T 4 , t3 = 5T 8 in addition to T and compare their Fisher information ratio ] δ,δ .The results are summarized in Table 4.The entries with bold numbers indicate the method has a better performance for estimating δ than using J (i.e., ratio greater than 0.651).As expected, we see that adding more points always increases the accuracy of estimation, no matter which points you already include.In this example, similar to Theorem 1, the inclusion of the earliest sample point St 1 brings most improvement (just the two points with St 1 and ST can outperform J) and any combination of three points can outperform the VWAP cost J.The method based on four points recovers about 70% of the I S full .However, we do note that I S full here are computed based on the ideal, limiting case where ∆t ↓ 0 and N → ∞ (the number observations approach infinity).Moreover, we note that this depends on the particular choice of points.For example, if we let t1 = T 4 , t2 = T 2 , t3 = 3T 4 and t4 = T , then we would have ] δ,δ = 0.662, whereas if we set t1 = T 20 , t2 = T 10 , t3 = T 3 and t4 = T , we would have ] δ,δ = 0.749 (here we see again the improvement related to the inclusion of earlier sample points).
We again verify our results on simulation.The results are summarized in Table 5.The performance is evaluated using the squared error between the true cost and the estimated cost, and the parenthesis shows the standard error of the mean error.J is the model based on the average cost of VWAP, others are based on the points along the price trajectory.t1 = T 8 , t2 = T 2 , t3 = 5T 8 and t4 = T .In the simulation study, we use similar settings as the above theoretical analysis with the power-law kernel G(s) = s −γ and power-law impact f (v) = v δ .We set T = 0.1, v = 0.3, σ = 0.2, γ = 0.4 and δ = 0.6.The diffusion process is discretized into 1024 time bins.The data-generating model and the estimation model are in the same parametric format and only δ is estimated.The first model is based on the average cost of VWAP as in (39).Other models rely on the price trajectory as discussed in section 3.2 with a subset or all points at t1 = T 8 , t2 = T 2 , t3 = 5T 8 and t4 = T .Each estimation has 300 samples, and the simulation is repeated 40,000 times.The performance is evaluated using the squared error of the cost with the estimated δ. Bold numbers in Table 5 indicate the method has a better performance than using J, after a standard two-sample t-test.The conclusion agrees with the theoretical analysis that the cost-based method is inferior to the price trajectory-based methods if earlier price trajectory points are selected (for example t1 or t2).In addition, if more trajectory points are chosen or the earlier the points are selected, the performance will be better.Table 5: Numerical comparison of models with power-law impact using simulation.

Example 1: Exponential Kernel
Next, we consider the setting of [Obizhaeva and Wang, 2013], where f (v) ∝ v and G(s) ∝ e −ρs so that µ(t, v; θ) = cv ρ (1 − e −ρt ) for some c.For this example, we assume T = 0.1, v = 0.3, σ = 1 c = 0.01 and ρ = 0.15.In this example, we estimate c and ρ jointly and we can compute I S full now as (again we let ∆t ↓ 0): For joint estimation, we no longer simply compare the ratio of specific entries in I against I S full as in the last example.Instead, we compute the matrix 2-norm (i.e., the spectral norm) of I 1/2 S full .To see why, note that which, in view of ( 18), implies the asymptotic variance for estimating any function of θ using X cannot be more than (1 + ) times more than the asymptotic variance estimated using S full (i.e., smaller value of I 1/2 S full 2 indicates a more accurate estimate, the optimal estimation based on S full has I 1/2 S full 2 = 1).Moreover, same as the last example, we also compare with the VWAP cost J on the asymptotic variance for estimating c (i.e., calibrating impact function f .Here, as before, we use the ratio ]c,c = 0.754.).We do not compute the joint estimation of c, ρ under VWAP cost J as the Fisher information based on J alone has rank one.Again, we pick the following representative sampling times t1 = T 8 , t2 = T 4 , t3 = 5T 8 in addition to T .The results are summarized in Table 6.c,c 1-1.258e-5 1-8.184e-6 1-5.590e-6 1-7.962e-6 1-3.376e-6 1-2.274e-6 1-2.051e-6 Table 6: Comparison of Fisher information for calibrating [Obizhaeva and Wang, 2013].
The numerical results from Table 6 are in line with the findings from the first example, where adding more points always increases the accuracy of estimation.However, for this model, the inclusion of earlier points St 1 does not improve the estimation as much as the inclusion of later points St 3 .This is due to the fact that, we are not just calibrating impact function f , but also the decay kernel G.As we can see, in this example the asymptotic variance of estimation (for any function of θ) using four points method is within 12.2% of the optimal one.On the other hand, for calibrating the impact function f , all sampling-based methods in the table are close to optimal (i.e.close to 1) and considerably outperform J (i.e., 75.4%).Remark 4.Although we have shown that the full price trajectory data S full is most efficient and adding more price points St increases the efficiency, there seems to be, based on previous examples, a "diminishing return" effect to adding price points, as the increase in efficiency tends to decrease in each addition.However, the quantification of such an effect or the characterization of efficiency for n given points requires further research.

Square-root Law: A Special Case
A well-known, widely-used rule-of-thumb to produce a pre-trade estimate of cost is the square-root law.As stated in [Tóth et al., 2011], the square-root law provides a remarkably good fit on empirical data (although it is argued in [Zarinelli et al., 2015] that the logarithmic function provides a better fit across more orders of magnitude than the square root function).It is suggested that the square-root law remains a robust statistical phenomenon across a spectrum of traded instruments/markets and is roughly independent of trading period, order type, trade duration/rate, and stock capitalization [Zarinelli et al., 2015].In particular, under the framework (1), the square-root law can be stated in terms of the impact [Zarinelli et al., 2015]: or the average cost per-share [Gatheral, 2010]: The original statement uses X/VD but we have scaled X by VD (i.e., VD = 1, see remarks following (1)).As noted in [Zarinelli et al., 2015], under (48), the market impact no longer depends on the specific trading rate v or trade duration T , but only on their product X = vT .This phenomenon can be seen as a special case of propagator model under the power-law impact f (v) ∝ v δ and power-law decay G(s) = s −γ , with the constraint that δ + γ = 1 (see [Gatheral, 2010]).Under such model, ( 48) is consistent and ( 49) is consistent with when δ = γ = 1 2 , as a special case of γ + δ = 1.From an estimation point of view, if we want to calibrate such a propagator model, where the impact only depends on the product vt, by estimating δ under the constraint γ + δ = 1, we can compare the estimation method using sampled points (St, ST ) based on (50) or using the cost of the VWAP J = v T 0 Stdt − XS0 based on (51).Leveraging techniques from sections before, we have the following Corollary.Again we observe the power of an early sample around 20%.Yet, just as in Table 3, sampling too early is not sufficiently helpful.

Limitations
In this section, we discuss some limitations and their practical implications on the aforementioned theorems.First, theorems in Section 3 rely on Assumption 1, which states, the stochastic law (or the underlying "true" structure) governing the process of market impact can be correctly specified within the parameterized model.It is natural to discuss, when this assumption fails, how the results of previous theorems hold and what types of adjustments need to be made regarding the quantification of estimation "efficiency".

Model Misspecification
Model misspecification occurs when the "true" underlying process of market impact cannot be correctly specified by (1) (or a specific variant of (1), e.g., Almgren Chriss model or the propagator model).In other words, given (T, v) ∼ G order , the joint distribution of the discretized price trajectory S ∈ R N follows distribution F ; however, a θ ∈ Θ such that F (θ) equals F does not exist, where F (θ ) is characterized by St = S0 + µ θ (t, v) + σWt for all t ∈ [0, T ] and (T, v) ∼ G order .In this setting, under regularity conditions, it is shown in [White, 1982, Bishwal, 2007] that the MLE (or QMLE, quasimaximum likelihood estimator in this setting [White, 1982]) θMLE = arg max θ∈Θ j l(Sj|θ) estimates the parameter θ KL , which minimizes the Kullback-Leibler (KL) divergence with consistency and asymptotic normality.As a widely used statistical distance measure, the KL divergence (also known as relative entropy) DKL(P Q) between two probability measure P and Q is defined as where dP dQ is the Radon-Nikodym derivative of P w.r.t Q.In other words, under model misspecification, the MLE estimates the parameters whose corresponding price trajectory distribution is the closest to the true one, in terms of the statistical distance measured by KL-divergence.Since F is unknown, to gain some intuition on the property of θ KL , we consider the following idealized example, where we observe the entire continuous path {St} 0≤t≤T .Example 10.Given (v, T ), suppose the true price process {St} 0≤t≤T is governed by a diffusion process: for some µ (•), σ (•), which cannot be described by (1) for any θ ∈ Θ: Then, using Girsanov theorem, [McKEAGUE, 1984] showed the log-likelihood function of the process {St} 0≤t≤T , as an element of C[0, T ] (C[0, T ] represents the space of continuous function on [0, T ]), can be written as Given sufficient metaorder data, the sum of the log-likelihood can be maximized to obtain an optimal value θMLE (if exists), which can be seen as an estimator for Thus, in this example, under model misspecification, the θ corresponds to a model whose drift term (i.e., ∂µ θ (t,v) ∂t , or the gradient of impact function µ θ (t, v) w.r.t t,), on average, best matches true drift term µ (St; t, v) of St in mean squared error over [0, T ].This example suggests the MLE from (1) seeks to recover an impact function that best describes the shape of the impact function by matching gradients (or how the gradient diminishes, in presence of concavity of impact function [Zarinelli et al., 2015]), rather than the direct value the impact function.
More importantly, how does model misspecification affect our discussion, largely based on the Fisher information, regarding the (asymptotic) "efficiency" of MLE?When the model is correctly specified, as mentioned in ( 16), the asymptotic covariance matrix of θMLE scaling with 1 √ n (i.e., the number of metaorders) is the Fisher information matrix evaluated at the true model parameter θ , which can be equivalently defined in Hessian form, known as the information matrix equivalence theorem [White, 1982], as where [B]ij(θ ) −E ∂ 2 l(θ|S) ∂θ i ∂θ j .Note both E above refer to expectation taken w.r.t. the true probability measure (E θ when the model is correctly specified and is generally unknown under model misspecification).However, as shown in [White, 1982], under model misspecification, (54) no longer holds and the (scaled) asymptotic variance as θMLE approaches θ KL becomes Moreover, [White, 1982] shows the discrepancies between empirical estimates of A(θ KL ) and B(θ KL ) can be used to test model misspecification.However, this implies the conclusion drawn from our previous theorems no longer holds, as they rely on the comparison of the Fisher information of correctly specified models.In this case, it becomes less clear how to choose the appropriate models or consider estimation efficiency of MLE based on experiment design, as theoretical properties of A(θ KL ) or B(θ KL ) depends on the unknown, true distribution of the price trajectory.
Thus, the presence of model misspecification will pose a limitation to the results of previous theorems.To better discuss the extent of such limitations and possible remedies, we revisit relevant concepts on model selection from statistical learning/machine learning.

Simulation
In this section, we perform a simulation study for the model misspecification case.The estimator model follows the power-law kernel G(s) = s −γ and power-law impact f , where δ is the only parameter that needs to be estimated.However, the sample-generating model does not have the same parametric format.The generating model has G(s) = s −γ and f (v) = 1.5 ln(1 + 0.5v 2 + 0.7v) (so µ (t, v) = 1.5 ln(1 + 0.5v 2 + 0.7v) t 1−γ 1−γ ).We set σ = 0.2, γ = 0.4.Each estimation has 300 samples, and each sample trajectory is assigned a random price trajectory length T ∼ Uniform(0.1,0.15) and a random v ∼ Uniform(0.3,0.4).The simulation is repeated 40,000 times for each scenario.The diffusion process is discretized into 1024 time bins.The first model is based on the average cost of VWAP as in (39).Other models rely on the price trajectory as discussed in section 3.2 with a subset or all points at t1 = T 8 , t2 = T 2 , t3 = 5T 8 and t4 = T relative to the corresponding path length T .The error of estimation is the mean squared error of the cost across all sample paths.Results are shown in Table 7.The conclusion is similar to the well-specified model in section 4.3 and the simulation study therein that the cost-based method is worse than the price trajectory-based methods if earlier price trajectory points are selected (for example t1 or t2).In addition, if more trajectory points are chosen or the earlier the points are selected, the performance will be better.The performance is evaluated using the squared error between the true cost and the estimated cost, and the parenthesis shows the standard error of the mean error.X = J is the model based on the average cost of VWAP, others are based on the points along the price trajectory.t1 = T 8 , t2 = T 2 , t3 = 5T 8 and t4 = T .Bold numbers indicate the method has a statistically significant better performance than using J based on two sample t-tests.

Model Selection
As in (52), given (T, v) ∼ G order , we use F to denote the true distribution of price trajectory S and F (θ) the one characterized under St = S0 + µ θ (t, v) + σWt for St ∈ S and (T, v) ∼ G order .Then, as discussed in the previous section, if we measure the risk as which quantifies the discrepancy between distributions fitted by the model and truth.Then, the model selection problem decomposes the risk as In a correctly specified model, θ KL = θ and the estimation error is 0. The approximation error is directly related to the efficiency for estimating θ and the results of our previous theorem, based on the Fisher information matrix, can be applied.When a model misspecification occurs, different factors/concerns can come into play and the landscape is less clear.For example, • If one is concerned with an accurate estimation of the price trajectory for various values of (T, v), then the estimation using the price trajectory would increase the sample size, compared to using summary statistics, e.g., VWAP, which could lead to a positive effect on reducing estimation error.
On the other hand, depending on how the distribution is misspecified, e.g., heavy tail versus light tail, the method using summary statistics could turn out to be more robust than using the entire trajectory.In this case, the previous comparison between using different summary statistics (VWAP or sampled price points) may or may not hold.
• If one is only concerned with an accurate estimation of summary statistics, e.g., VWAP.Then, this could lead to a reduced approximation error because KL divergence monotonically decreases under the non-invertible transformation (it is well known that KL divergence is invariant under invertible transformations, i.e., given random variable X, Y and an invertible transformation g(•), we have DKL(X|Y ) = DKL(g(X)|g(Y )).However, it can be shown that DKL(X|Y ) ≥ DKL(g(X)|g(Y )) when g is non-invertible (see [Mena et al., 2018]).The transformation from the price trajectory S to summary statistics are typically non-invertible.).However, it could lead to an increase in approximation error due to reduced sample size or inefficiencies of estimation.
During practical implementation, we should take these issues into consideration.Typical methods include cross-validation upon different methods (or some hybrid of them) under the specified loss function or Bayesian inference based on carefully calibrated prior from previous data (or even online data, where one can update the posterior as the order is executed and adjust the execution).This is left for discussion and future research.

A.2.1 The Proof of Lemma 3
Proof of Lemma 3. The rigorous proof can be found in Theorem 7.2 of [Ibragimov and Has' Minskii, 2013].It can be checked that the assumptions in Theorem 7.2 are all satisfied.For a more concise proof, similar to the proof in [Zamir, 1998], we can use Lemma 6 to obtain: Since φ(X) is a statistic of X, it is clear that the conditional distribution of φ(X) given X is independent of θ, implying that I φ(X)|X (θ) = 0. Consequently, (56) now becomes Since I X|φ(X) (θ) 0 (any Fisher information is positive semi-definite as it is the covariance matrix of the score function), we have shown that Now, if φ(X) is a sufficient statistic, then the conditional distribution of X given φ(X) is independent of θ which leads to I X|φ(X) (θ) = 0 and the equality of I X (θ) = I φ(X) (θ).On the other hand, if I X (θ) = I φ(X) (θ), we must have E φ(X) [I X|φ(X)=φ(s) (θ)] = 0, but this does not necessarily imply φ(X) is a sufficient statistic.An interesting counter-example where an insufficient statistic preserves Fisher information can be found in [Kagan and Shepp, 2005].To complete the proof, an additional condition on the continuous differentiability of density is enough to establish the sufficiency of φ(X), the detailed proof again is shown in [Kagan and Shepp, 2005].A.3 The Joint Distribution of (I, J) in Almgren-Chriss Model In this derivation, we fix (v, T ) as constant and derive the distribution of (I, J) conditional on (v, T ).Notice that, after the affine transformation Pt = S t −S 0 S 0 , we have Pt =g(v)t + h(v) + σWt, when t ≤ T Pt =g(v)T + σWt, when t > T .
Following the notations in [Almgren et al., 2005], we let S = On the other hand, again using Itô's Lemma, we have Var( T 0 Wtdt) = T 3 3 and Var(WT post ) = Tpost.Thus, it can be checked that Corr( T 0 Wtdt, WT post ) = 3T 4T post .This allows characterizing the joint distribution of I and J completely, equivalent as in equation ( 2) of [Almgren et al., 2005], by noting that there exist two independent standard normal random variables ξ1, ξ2 where which can then be shown to be equivalent as (23).
In order to show IU (θ) II,J (θ) for generic form of h, g, it suffices to show that The fact that IJ (θ) = 2 can be directly calculated using Proposition 1, and note that by Fubini theorem that T 0 t 0 G(s)ds = T 0 G(t)(T − t)dt.Finally, to prove (45), we first write S partial as S for ease of notation.Note that it follows from Lemma 4 that for 1 ≤ i ≤ NK.Since ∆S K i is independent with ∆S K j for i = j, it follows that (80) becomes On the other hand, fix 1 ≤ i ≤ NK, define Ti = t K i − t K i−1 , ∆i = T i N and ∆Sij = j∆i so that we have We further define similarly ∆Gij and which, similar as (77), can be shown to satisfy as N → ∞, Now, fix 1 ≤ i ≤ n, as in (79), base on Lemma 6, we have Again, we calculate the joint distribution .
which further gives a conditional distribution .

A.3.7 Proof of Corollary 3
Under the setting of Corollary 3, we have for some c.Under the setting of Corollary 3, for St, ST , we have Proof of Lemma 4. The first claim follows directly from Lemma 6 since I Y |X=x (θ) 0 and E X [I Y |X=x (θ)] 0. For the second claim, notice by Lemma 6,I Y (θ) + E Y [I X|Y =y (θ)] = I X,Y (θ) = I X (θ) + E X [I Y |X=x (θ)].Since Y = φ(X) and φ(•) is a bijective mapping, it is easy to check that E X [I Y |X=x (θ)] = E Y [I X|Y =y (θ)] = 0 which implies I X (θ) = I Y (θ).

Table 4 :
Comparison of Fisher information for calibrating power-law impact.
S t1 , S T S t2 , S T S t3 , S T S t1 , S t2 , S T S t1 , S t3 , S T S t2 , S t3 , S T S t1 , S t2 , S t3 , S T I

Table 7 :
Numerical comparison of models with misspecification.