Efficiency of judicial systems: model definition and output estimation

Focusing on the Italian judicial system as our case study, we use Data Envelopment Analysis to estimate technical efficiency scores and reference values for policy makers. In detail, this work presents a comparative analysis of different model definitions to identify the most appropriate one, emphasizing the key role of case matters in this production process. According to our results, the North of Italy emerges as more efficient than the other Italian macro areas, although the gap significantly decreases when case matters are considered in the output estimation. Concerning the collected reference values, which might be adopted by policy makers to reform the judicial system, we can observe significant differences able to affect the reorganization of courts. Taking the proposed case study into account, it seems that improvements in court performance could be achieved by reforming civil procedures, which are the technologies applied by judges in their production process. ARTICLE HISTORY Received 20 March 2020 Accepted 27 May 2020

interesting case study. 1 Moreover, the judiciary is regularly mentioned in debates around the Italian economy, with a view to determining whether the nation's current economic difficulties are related to international trends or to structural problems in the institutions, such as, for example, the judiciary (Lanau, Esposito, & Pompe, 2014). Without entering into the Italian debate, this work tries to shed new light on the estimation of judicial efficiency, by identifying the most appropriate model definition and by offering policy makers some additional insights. On the one hand, we emphasize the need to analyze courts according to their different production lines (i.e., case matters) and the related technologies applied by judges (i.e., civil procedures). On the other hand, we try to understand whether the composition of the demand for justice can affect the benchmarking analysis and potential reference values to be used by policy makers in the reform process. These are precisely the goals of this study, that is to say, to identify the most appropriate model definition for the estimation of judicial efficiency and to establish whether an incorrect approach can have a significant impact on the policy makers' decision-making process. Moreover, focusing on the specific case study, our results might point to the need to reform the technologies applied to the production lines of this key sector (i.e., Italian civil procedures).
The paper is organized as follows. Section 2 offers a review of the current literature on judicial efficiency and court productivity, highlighting the model definitions proposed and inputs/outputs adopted. Section 3 introduces the implemented methodology (i.e., Data Envelopment Analysis) and some data regarding the case study (i.e., Italian civil justice in 2011). Section 4 illustrates the main results of the comparative analysis, presenting the estimated technical efficiency scores and potential reference values. Finally, some conclusions and policy implications are discussed in Section 5.
This work proposes Data Envelopment Analysis (DEA) to measure judicial efficiency, estimating a technical efficiency score for every judicial district. DEA has been successfully adopted in judicial analysis, both in its one-stage form (e.g., Kittelsen & Førsund, 1992;Pedraja Chaparro & Salinas-Jimenez, 1996;Santos & Amado, 2014) and in its two-stage form (Deyneli, 2012;Schneider, 2005). 2 Even though this technique is widely accepted and used by academia to analyse the judiciary, a key open question remains: which are the most appropriate inputs and outputs of the justice production process?
This is a critical issue since, depending on the model definition, policy makers might use different reference values to implement structural reforms of the national judicial system. For example, the last main reform of the Italian judicial system, which was aimed at redefining the territorial competence of the courts (i.e., reform of Italy's judicial geography), was based on national reference values (Ippoliti, 2015). Obviously, if the model definition is incorrect, policy makers might be misled by the results obtained, adopt imprecise reference values, and ultimately introduce inappropriate reforms. For this reason, input selection and output definition are crucial and, considering the current heterogeneity in the literature, there is a great need to shed new light on this issue by identifying the most appropriate model definition. Table 1 presents a review of the current literature, showing the inputs and outputs adopted, as well as the judicial systems analysed and the mathematical programming techniques used. As readers can observe in the table, the number of settled cases is identified as the main output, although it is presented as an aggregate measure. Only few studies have tried to adopt a more precise output measure by disaggregating the supply of justice according to case matters (i.e., Kittelsen & Førsund, 1992;Santos & Amado, 2014). At the same time, even greater heterogeneity can be observed when inputs are considered. Some authors have exclusively used judges and staff as inputs (e.g., Deyneli, 2012;Pedraja Chaparro & Salinas-Jimenez, 1996), while other researchers have also included pending and/or incoming cases (e.g., Falavigna et al., 2015;Finocchiaro Castro & Guccio, 2014;Ippoliti, 2015;Ippoliti & Vatiero, 2014;Schneider, 2005), suggesting that the demand for justice might affect court productivity. Therefore, there is no common and clear model definition to estimate judicial efficiency.
However, from a general point of view, we cannot treat in the same manner factors that can be regarded as actual inputs (e.g., judges or staff), and are therefore under the control of Decision Making Units (DMUs), and factors beyond the control of DMUs (e.g., demand for justice). The production function represents the technical relationship between chosen inputs and outputs, while the other factors can affect it parametrically or through non-parametric shifting factors. This is the main reason for adopting a two-stage analysis or other techniques aiming to bypass influences not directly depending on DMUs (i.e., environmental variables). 3 A first attempt to investigate this relevant issue is made by Finocchiaro Castro and Guccio (2015;, who regard the caseload as a non-discretionary input related to the environment in which the courts operate. In this way, they distinguish between managerial inefficiency and inefficiency due to uncontrollable inputs (i.e., pending and incoming cases). Might backlog affect the production process?
The work of the judiciary can be considered a case of service production (supply of justice), in which production transforms each of the items that enter the process (demand 2 According to Simar and Wilson (2007), the one-stage DEA procedure aims to estimate and analyze efficiency, while the two-stage DEA procedure uses the estimated scores to study the determinants of inefficiency. 3 For a survey, see Muniz (2002 for justice). Each client (i.e., a person or a firm, as well as their lawyers) starts with a case that requires a decision, and the number of clients entering the transformation process is exactly the same as the number of people leaving with a decision. Are clients (or their needs) an input? Can the number of clients and the potential waiting times affect this transformation process? If we assume that a court can deal with the same number of cases, even when new clients arrive and a long line of waiting people forms, the number of pending and/or incoming cases is not relevant to the transformation process. 4 Conversely, if we assume that the negative externality created by the backlog, i.e., the delay in receiving justice, might affect the judges' efforts and decisions, then the demand for justice should be included in the production process. This is the only way to accept the workload as an uncontrollable input of the courts' productive process, which leads to the assumption that pending and incoming cases put pressure on judges, driving them to increase their performance. This is exactly the hypothesis proposed by Beenstock and Haitovsky (2004), according to which, in order to reduce the negative externalities caused by delay, judges adapt their efforts proportionally to the workload. This proposition is coherent with the current literature, which suggests using environmental variables as potential inputs, whether or not they actually affect the production process. However, which might be the most appropriate way to handle these uncontrollable environmental variables? An alternative approach might be the use of a resolution index as output, as put forward by Yeung and Azevedo (2011). Coherently with the hypothesis that the demand for justice (and the related long line of waiting people) might be a determinant of court productivity, they suggest including the workload within the output. In other words, they introduce a resolution index as output, normalizing the number of settled cases for the demand for justice. However, Yeung and Azevedo (2011) do not consider the judicial case matters, as suggested respectively by Santos and Amado (2014) and Kittelsen and Førsund (1992). How can we compare the performance of two courts with different amounts of demand for justice? In other words, assuming that each case matter is a production line with its own technology (i.e., a specific judicial procedure), how can we compare courts displaying significant differences on the demand side? In order to account for differences in demand, researchers should disaggregate the supply of justice according to its production lines. Doing so would provide a more realistic estimation of court performance. The key idea behind this approach is that every case matter has a different civil procedure, that is to say, a different technology to produce the expected output (i.e., justice). For example, there are very large differences in the procedures followed to settle a litigious divorce and a bankruptcy case. Without accounting for these differences, we cannot properly estimate the efficiency of courts and we might even identify incorrect reference values for a judicial reform. Indeed, following this line of reasoning, the interpretation of results might lead policy makers and/or public managers to the wrong conclusions and, ultimately, to the implementation of the wrong reforms.
Therefore, it is essential to properly define the output of this productive process, as well as the role played by case matters and caseload in the estimation of court efficiency. From a methodological point of view, these are exactly the goals of our research.

Methodology and data
The methodology applied in this work to estimate court performance is Data Envelopment Analysis (DEA). This section presents a technical overview of this methodology, highlighting the inputs and outputs adopted, as well as the model definitions.

Data Envelopment Analysis (DEA)
DEA has been applied extensively in the last 40 years (Emrouznejad & Yang, 2017). It has been adopted to study the performance of public institutions such as, for example, health care (e.g., Mitropoulos, Talias, & Mitropoulos, 2015;Pulina, Detotto, & Paba, 2010), the police forces (e.g., Drake & Simper, 2004), universities (e.g., Fandel, 2007), as well as the judiciary (e.g., Peyrache & Zago, 2016;Santos & Amado, 2014). This is a non-parametric technique that allows efficiency performance to be measured as a score (Cook & Seiford, 2009), implementing a benchmark analysis. Indeed, the DEA approach lets researchers build a deterministic, non-parametric production frontier comparing the performance of several Decision Making Units (DMUs), which in our case are the courts of first instance. Technical efficiency scores are computed based on the radial distance of every DMU from the frontier (Charnes, Cooper, & Rhodes, 1978;Coelli, Rao Prasada, & Battese, 1998;Färe & Grosskopf, 1996). Here we use the output-oriented model, as proposed by Farrell (1957), assuming Variable Returns to Scale (VRS) (Banker et al., 1984). 5 As explained in Ippoliti and Falavigna (2012), the technical efficiency scores (TE i ) referring to each first instance court (i.e., our DMUs) are computed as follows: where n is the number of DMUs and 1 ≤ TE i ≤ +∞. TE i scores are computed by solving the following linear programming duality problem, on the basis of the output-oriented DEA approach (Farrell, 1957): where z is a scalar > 1, λ is a vector of nx1 weights allowing for convex combination of inputs and outputs, Y is an sxn output matrix, X is an input matrix, and N1 is an Nx1 unitary vector. Furthermore, z-1 indicates the proportional output increment maintaining the input level constant. 6 The results of the DEA methodology are technical efficiency scores referring to each court and representing its position in relation to the frontier (i.e., the benchmark). In detail, the scores indicate the ability of each first instance court to maximize the proposed output, given the available resources. Inputs and outputs are defined based on our model definition (see Sections 3.2 and 3.3).
Note that, according to Simar and Wilson (2007), in order to compare the results of different model definitions, we calculate the reciprocal of the estimated scores (i.e., 1/ technical efficiency score).

Output estimation
Our approach includes two outputs: the number of cases settled and a resolution index. In both situations, we estimate the outputs considering the aggregate supply of justice (all case matters together), as well as its disaggregate supply (1 output per case matter). As highlighted in Section 2, the number of settled cases is the most common output currently found in the literature (e.g., Finocchiaro Castro & Guccio, 2015;Peyrache & Zago, 2016); while the resolution index has been proposed only by Yeung and Azevedo (2011).
The resolution index is estimated as follows: where i represents the i-th first instance judicial district considered at year(s) t, while the workload is given by pending cases (at the beginning of the year) and incoming cases (during the year), normalized by 100 (Yeung & Azevedo, 2011). The resolution index is an evolution of the clearance rate since, in this case, the denominator is given by the workload, instead of the incoming cases. 7 Innovatively, this index can estimate court performance without considering the demand for justice an uncontrollable input.

Model definition
Coherently with the previous sub-section, we propose several model definitions (Table 2). On the one hand, models A and B are aimed at examining differences in regarding the number of settled cases either as a single aggregate output or as a disaggregated series of outputs (according to case matter), adopting the aggregate demand for justice as uncontrollable input. On the other hand, models C and D are aimed at examining differences in regarding the resolution index either as a single aggregate output or as a disaggregated series of outputs (according to case matter), including both demand and supply of justice in the estimated index. By following this approach, which relies on comparing two series of outputs, we can collect more robust results.
Focusing on the Italian case study, we have identified 13 civil case matters for our output estimation: pension, default application, default, regular execution, real estate execution, consensual separation, litigious separation, consensual divorce, litigious divorce, special procedure, private and public labour, ordinary jurisdiction, other. For what concerns the inputs, we have collected data about the judges and 3 administrative levels of staff, as well as the aggregated demand for justice (i.e., workload) for models A and B.

Data: the Italian judicial system
The Italian Ministry of Justice is in charge of administering civil and criminal justice, which is divided into two main tiers and one lowest level. At the lowest level are the so-called Justices of the Peace (i.e., Giudici di Pace), with specific civil and criminal competences. At a higher level, the first tier includes first instance courts (i.e., Tribunali Ordinari), which, gathering together the aforementioned justices of the peace, are part of the first instance districts (i.e., Circondari Giudiziari). In the period considered (i.e., 2011), there were 165 first instance districts, which represent the observations of our study. 8 The second tier comprises 26 second instance districts (i.e. Distretti di Corte di Appello), each with a variable number of first instance districts and responsible for appeals against first instance judgments. Finally, there is also a court of last resort (i.e. Corte Suprema di Cassazione), with seat in Rome and acting as the highest appellate court in all civil and criminal cases. Considering 2011, Table 3 illustrates the heterogeneity of first instance courts, according to Italy's five macro areas (i.e., North-West, North-East, Centre, South, and Islands) and second instance districts. More precisely, the table highlights both the demand and supply of justice, as well as the human resources involved in the production process.
Looking at the numbers, we can observe the extent of the phenomena under investigation. On the one hand, pending civil cases amount to more than 3 million, while, on the other hand, the number of incoming cases is also close to 3 million. These figures are even more significant if we consider that there are only 20 thousand workers tasked with processing the whole caseload (i.e., around 4 thousand judges and 16 thousand staff).
Tables 4 and 5 present some other descriptive statistics about inputs and outputs based on the selected case study (i.e., Italian judicial system), and the four model definitions proposed. In detail, the data refer to the Italian civil justice in 2011, considering 164 first instance courts (see Figure A.1 in the Annex for the judicial geography and the competence of the DMUs analysed). 9 The staff is disaggregated into three levels, depending on professional position: the third level comprises executives with the highest responsibilities, the second level includes the  Table 6 shows the time needed to settle a case according to case matter in 2011, which is a good proxy for the technologies used by judges along their production lines. For example, considering litigious and non-litigious household dissolutions, significant differences clearly emerge among case matters. On average, focusing on litigious dissolutions in 2011, 663 days were necessary for the first step (i.e., litigious separation) and another 702 days for the second step (i.e., litigious divorce), which adds up to a total period of almost 4 years. As for non-litigious household dissolutions in the same year, on average, only 218 days were necessary. These long settlement times can be ascribed to litigiousness between parties and/or the lawyers' opportunistic behaviour (Felli, Londoñ-Bedoya, Solferino, & Tria, 2007), but the current procedures undoubtedly play a key role.       Table A.2 shows the percentage of workload by case matter in 2011, highlighting the different amounts of demand for justice dealt with by our DMUs, according to case matters and judicial districts of second instance, as well as geographical macro areas. Finally, further information is presented in Figure A.2 in the Annex, which includes maps regarding justice demand and supply with respect to the available human resources. Table 7 shows the technical efficiency scores according to Italy's five macro areas (i.e., North-West, North-East, Centre, South, and Islands) and second instance districts. On average, the technical efficiency score in model A is equal to 0.7417, with the North of Italy as the most efficient area (i.e., 0.8351 for the North-West and 0.8500 for the North-East). However, the gap between the North and the South of Italy decreases if we consider the disaggregated supply of justice. On average, the technical efficiency score rises by 17.16% adopting model B. These improvements are greater in the South of Italy (i.e., 25.56%) and Islands (i.e., 21.22%), while they are significantly smaller in the North of Italy (i.e., 10.06% in the North-West and 10.08% in the North-East).

Results
Looking at models C and D, a similar scenario emerges. On average, the technical efficiency score in model C is equal to 0.6611, with the North of Italy as the most efficient area (i.e., 0.7840 for the North-West and 0.7741 for the North-East). Again, the gap between the North and the South of Italy decreases if we consider the disaggregated supply of justice. Adopting model D, the average technical efficiency score rises by 26.69%. These improvements are greater in the South (i.e., 36.89%) and Islands (i.e., 35.87%), while they are significantly smaller in the North of Italy (i.e., 15.93% and 18.21%, respectively in the North-West and North-East).
What about models B and D? Analyzing the results presented in Table 7, we can identify a significant difference between the specifications of models B and D only in one case. On average, the technical efficiency score rises by 1.47% adopting model D, with a considerable improvement only in the Islands macro area (i.e., 7.73%). In the other macro areas, the average scores collected using the two model specifications are almost the same (i.e., differences equal to 0.76% in the North-West, 0.54% in the North-East, 0.04% in the Centre and 0.24% in the South). Accordingly, only the gap between the North of Italy and the Islands decreases if we include the workload in the resolution indexes.
These results become even more important if we compare the DMUs in relation to potential reference values that policy makers may use to reorganize the judicial system, based on the technical efficiency of the courts.
Using the national average value of model D (i.e., 0.9280) as vertical axis and the national average value of model C (i.e., 0.6611) as horizontal axis, Figure 1 highlights the efficiency gap between the reference values and the TE scores of DMUs located in the South and in the North of Italy (i.e., North-West and North-East). Two cases appear to be particularly interesting: the quadrant with DMUs having TE scores that are under the national average in model C but over the reference value in model D, as well as the quadrant with DMUs having TE scores that are over the average in model C but under the average in model D. Readers can observe the prevalence of DMUs located in the South of Italy in the former quadrant (e.g., courts of Ariano Arpino and Avellino) and the prevalence of DMUs located in the North of Italy in the latter quadrant (e.g., courts of Parma and Varese). The second instance district of Campobasso is an even more significant example since, according to model C, it is among the worst performing second instance districts in Italy, while, adopting model D, this district is on the efficiency frontier (i.e., 0.9661).
Finally, several t-tests are performed to reject the hypothesis (H0) that there are no statistically significant differences among the models and the technical efficiency scores collected. Based on our results, we can reject H0 both considering model A versus model B (t equal to 16.279) and model C versus model D (t equal to 15.5208). These results are , while in the other three macro areas (i.e., North-East, Centre, and South) there are no statistically significant differences between these two model definitions.

Discussion
The last major reform of the Italian judicial system, aimed at reorganizing the competence of first instance courts, was based on reference values identified by technical commissions (Ippoliti, 2015). As a consequence of that reform, more than 30 courts were suppressed in 2013. What would have happened if the decision-making process followed by policy makers to design that reform had been based on our models? More precisely, adopting the average technical efficiency scores of models C and D as reference values, which might be the sensitivity and specificity levels of our stratification rule? Based on these questions, we can explore the policy implications of our work. Using model D as real efficiency value of our DMUs and the average national technical efficiency score of model C as reference value, as highlighted in Figure 1, we collect 11 false positives (i.e., courts which are incorrectly classified as more efficient than the average) and 18 false negatives (i.e., courts which are incorrectly classified as less efficient than the average), with expected sensitivity equal to 74.29% and specificity equal to 72.50%. Referring to the last judicial reform in Italy, based on average reference values, the false negatives would represent all the courts that were incorrectly suppressed, while the false positives would represent all the courts that were incorrectly preserved. According to our results, the predicted negative value would be equal to 61.70%, which translates into a false omission rate of 38.30%. This means that we cannot reject the hypothesis that either considering or disregarding the disaggregated supply of justice in the models affects the policy makers' decision-making process based on evidence and reference values. Obviously, the policy implications of these classifications would be significant, since 38.30% of the DMUs would have been incorrectly suppressed. Note that these results might hint at the presence of data aggregation bias, that is to say, a model specification with aggregated cases as output might bias the collected results and the consequent benchmark analysis. Accordingly, it is paramount to work at the highest level of detail whenever possible (i.e., with disaggregated cases as output), so that the results are not affected by this type of bias, which could mislead policy makers in developing a reform process of the judicial system through benchmarks.
Focusing on our specific case study, the most relevant result lies in the gap between the North and the South of Italy, that is to say, differences among courts in terms of efficiency. Indeed, all the models suggest that the courts located in the North of Italy perform better than those located in the South. However, taking the supply of justice into account (models B and D), this gap decreases dramatically, as illustrated in Table 6. Moreover, by contrasting the models with and without case matters, policy makers can estimate the inefficiency linked to the adopted technologies (i.e., civil procedures). This means that, by working on these civil procedures, the Italian government would be able implement appropriate interventions to reduce the judiciary gap among macro areas. There is only one final open issue: which might be the most appropriate model?
Both models B and D are major improvements on the current approaches, since they consider the various technologies applied in the production process, which are characterized by a high degree of diversity. However, there are different assumptions behind the proposed models. On the one hand, we can regard the demand for justice as a nondiscretionary input related to the environment in which the DMUs operate (i.e., uncontrollable input), assuming a general pressure effect due to caseload. On the other hand, we can incorporate the demand for justice into the supply to ensure a more precise estimation of court efficiency. Obviously, the second approach is more sophisticated, as it has to do with the ability of judges and staff to satisfy the demand for justice with respect to the case matters. Based on the assumptions behind the models (i.e., whether or not a pressure effect on judges does actually exist), either model definition can be adopted (i.e., B or D).

Conclusions
An appropriate policy decision-making process based on evidence requires correct model definition in order to implement a successful reform aimed at increasing court efficiency. Unless the model is suitably defined in the benchmark analysis, policy makers may be misled by false evidence into carrying out a wrong reorganization of the courts. These policy implications are even more relevant if the reform is based on reference values, as seen in the recent overhaul of Italy's judicial geography (Ippoliti, 2015).
The results of the case study analysed here highlight the need to reform the current civil procedures, which represent the technologies adopted by judges and staff in producing the expected output (i.e., justice). Our results are also coherent with demands for procedural reform put forth by scholars (Lanau et al., 2014), and the Italian government is indeed working in that direction by discussing whether to update some civil procedures (e.g., household dissolutions). The advantages of a progressive increase in court performance, by restyling the Italian civil procedures, are even more significant if we consider the current age of austerity, since policy makers would have the opportunity to improve the performance of this key sector without spending public resources. Tulkens, H. (1993 Table A1. Disaggregated descriptive statistics on average number of days needed to settle a case by civil case mattera (Italy, 2005(Italy, -2010.  Figure A1. Italian judicial geography of first and second instance (Italy, 2011). First instance districts (i.e., Circondari Giudiziari). Second instance districts (i.e., Distretti di Corte di Appello).