Selection on performance and tracking

ABSTRACT Tracking is widely used in secondary schools around the world. Some countries put more emphasis on the use of performance to place students into tracks (e.g. the Netherlands), while in other countries parents have more influence on the track their child will go to (e.g. Germany). This article examines whether selection into tracks based on performance has an effect on the relation between tracking and student performance and educational opportunities. Using data from the Programme for International Student Assessment for around 185 000 students in 31 countries, different estimation models are compared. The results indicate that a highly differentiated system is best for performance when schools always consider prior performance when deciding on student acceptance. In systems with a few tracks, there is no such impact. Equality of opportunity is best provided for in a system with many tracks when schools always consider prior performance.


I. Introduction
Tracking students in secondary school into distinct educational programmes of different ability levels influences students in a number of ways: In tracked systems, students are separated into tracks with different peers, different curricula and different teachers, while in non-tracked systems in theory all students share their learning environment. Previous research does not show consistent positive or negative effects of tracking on student performance (e.g. Hanushek and Woessmann 2006;Ariga and Brunello, 2007;Elk, van Der Steeg, and Webbink 2011). While regarding the inequality of outcomes between students with different parental background (PB; which captures among other things parental education and income), it is often found that tracking reinforces inequality (e.g. Ammermueller 2005;Schuetz, Ursprung, and Woessmann 2008). Although also on inequality there is no consensus (e.g. Walldinger 2006;Brunello and Checchi 2007).
However, a possible explanation for the mixed results found in the literature is that none of the previous papers has looked at how tracking is implemented. For instance, it could be that tracking has different effects on student performance and inequality depending on how students are selected into tracks. Both the possible unwanted effect of an increased effect of PB, and the possible positive or negative effect on student learning are dependent on how track placement is done: If track placement is done purely on ability levels then the effect of PB is reduced since parents cannot influence track choice directly, while tracks are homogenous in the ability composition which could increase learning. If track placement is not done based on ability, but for instance on PB, then increased learning might not happen since the tracks are not homogenous in ability and the effect of PB naturally increases.
The aim of this article is to investigate whether using prior performance to select students into tracks has an effect on the relation between tracking and student performance and between tracking and educational opportunities. Tracking does not vary within education systems. However, in most countries, school principals are often free in how they select students into tracks. We use data on whether school principals consider prior performance in accepting the students to the school from the Programme for International Student Assessment (PISA) 2009 for 31 OECD countries to investigate our question. We find that students who attend schools whose principals consider prior performance in a highly differentiated Supplemental data for this article can be accessed here.
system have higher test scores and a lower impact of PB than those in a comprehensive system. When comparing different estimation models we find it unlikely that our results are driven by the selection which causes better able students to go to schools whose principals consider prior performance in accepting the students to the school. Some countries have national policies regarding tracking (e.g. Germany, the Netherlands), while others let schools decide whether and how to implement informal tracking (e.g. United States, Sweden). The manner in which students are placed into tracks also differs widely across countries. For instance, in the Netherlands, elementary school students all take an obligatory exit test and combined with the obligatory recommendation of the elementary school teacher on the most suited track, the secondary school accepts students to specific tracks mainly based on that test and that recommendation. In Germany, however, in most northern states, parents have the right to persuade schools to accept their child into the highest tracks, while there is no exit test and the teacher's recommendation is only optional (Dollmann 2011). Higher educated parents are often more willing to ensure their child goes to the higher track, which introduces an advantage for their children. Of these two countries, only the Netherlands places students into tracks based on performance (a proxy of ability combining ability with motivation), while it is to be expected that in Germany a strong effect of PB on track placement, and consequently on student performance, is to be found and this is indeed the case (Dustmann 2004).
These two examples show that in the relation of tracking and student performance or educational opportunities, the method of selecting students into tracks could influence outcomes.
To take into account that school principals that consider prior performance possibly do so to be able to accept only the best students to their school, we compare the results of different models which look at between-school between-countries variation, between-school variation and between-countries variation. 1 In the first model, we control for the best available internationally comparable track level of the individual students. In the second model, we use only the between-schools variation by using country fixed effects, which alleviates possible country heterogeneity, for instance but not exclusively, on track placement policies. And third, by using the national percentage of schools which consider prior performance, we look at between-countries variation only to try to eliminate the bias due to selective student acceptance. That tracking has a positive effect on student performance when the method of track placement is taken into account is seen in all three the models. The coefficients are the smallest in the within country model, while they are the largest, although insignificant, in the between-countries model. If the relation was purely driven by a selection bias created by school principals, this would not be expected since then most of the variation would be between schools in a given country.
The findings indicate that it is important to consider how school principals select students into tracks when looking at the effects of tracking on performance or educational opportunities; or more general, that school characteristics need to be taken into account when analysing education systems. Perhaps this insight can also explain why the literature finds mixed effects of tracking on both student performance and educational opportunities.
The structure of this article is as follows: Section II discusses theoretical insights in tracking and how schools can affect tracking. Section III describes the data used in the analysis, while Section IV lays out the empirical strategy. Section V presents the results. Finally, the last section concludes the article.

II. Tracking
Tracking is widely used in secondary school systems around the world. The most important differences between countries in the implementation of tracking are on the number of tracks available to 15 year old students (most frequently ranging from two to five) and on the age of selection into tracks (from 10 to 16). In the countries that have tracking, tracks are 1 Another reason why schools that consider prior performance might not use this for track placement is that the entry of students into the secondary school might not coincide with the start of tracking. However, in most countries the start of secondary and tracking coincide. Only 7 of the 21 tracking countries do not start tracking at the start of secondary school. Russia's secondary school starts at age 10, but tracking starts at 14.5 years. Students in Luxembourg start secondary school at age 12, but tracking starts at 13. In Lithuania, it is 11 versus 15, in Italy 11 versus 14, in Israel, Ireland and Greece 12 versus 15. However, even for schools in these 7 countries, the earlier obtained prior performance may aid schools in track placement.
institutionalized in different school types and often located in different buildings and administrative units, while in countries without tracking non-institutionalized tracking can occur within schools, either by ability grouping (different classes within schools) or seating (different curricula within classes). 2 Tracking is a form of imposing peer homogeneity on students, while it also offers students a more targeted curriculum. Each track is aimed to consist of a more or less homogeneous student population, depending on the number of tracks available. However, the effect of imposing peer homogeneity is not theoretically straightforward. First, by removing the better-performing students from the lower tracks, the mean performance of the lower tracks decreases and the resulting lower level of peer performance can harm the performance of the lower ability students. In contrast, the performance of the high-ability students, who are now surrounded by more high-ability peers, improves with positive spillovers. If peer effects work through mean performance, as described above, we would expect to find no country effect of tracking since the positive and negative peer effects on performance cancel each other out. On the other hand, when peer effects are non-linear, tracking can have nationwide effects. The theoretical models of non-linear peer effects support either positive or negative effects of tracking (see Sacerdote 2011). For instance, when especially high-ability students benefit from high-ability peers, tracking has a positive overall effect; when especially low-ability students benefit from high-ability peers, tracking has a negative effect.
Second, peer homogeneity in tracks can be good for both high-and low-ability students when teachers target their teaching to the average performance of the class. In highly differentiated systems (i.e. systems with a large number of tracks), the top and bottom pupils are closer to the average performance level in the class and can thus benefit from peer homogeneity when it allows them to learn more from the teacher.
In addition to imposing peer homogeneity on students, tracking exposes students to specialized curricula, which means that students in different tracks are taught at different levels of difficulty. As long as the specialized curricula are optimally designed for the average characteristics of the students in the track, they should increase performance.
Overall, we would expect a positive effect of tracking due to improved teaching strategies and adjusted curricula, while the effect of tracking due to more homogeneous peer groups is theoretically uncertain. However, peer effects in the classroom are found to be very context specific, and, if they exist, the size of peer effects is modest (Sacerdote 2014) and thus we disregard them from our expectations. In sum, we therefore expect a positive effect of tracking due to more homogeneous classes. Disentangling the different effects from peers, adjusted curricula and adjusted teacher strategies is not possible in our setup. We will therefore look at all three effects combined.
The arguments in the paragraphs above assume that track placement is based on ability, for instance proxied by prior performance. However, as already mentioned, track placement is not always based on performance. Dustmann (2004) shows for Germany that PB is a strong predictor of track choice and that there is strong intergenerational immobility in track choice. When parents are free to send their child to any of the available tracks, they may choose the track they attended and/or the track they are familiar with. Schools could also select students based on artistic performance or on other aspects as religion or residential area. When this happens, tracks are no longer homogeneous in performance.
We expect that schools that use an objective measure to place students into tracks have greater performance homogeneity in tracks and this induces the expected positive effects of tracking as described above. Schools selecting students into tracks without an awareness of their observed abilities or basing their selection on non-academic criteria can severely limit the expected positive effects of tracking. Schools that have information on prior performance when they place students into tracks, may be better able to ensure homogeneous classes than schools that do not have the same.
We also expect that when schools use a performance measure to select students into track, the influence of parents is lower. Naturally, we do not assume that when schools consider prior performance, parents have no influence on school choice or performance. Parents always influence a child's ability, directly through genes and/or indirectly through the environment they create for their children. However, we assume that as long as observed ability limit non-ability-related parental influence on track choice, the effect of PB is decreased.

III. Data
The student-and school-level data used in this article are from the 2009 wave of the PISA, executed by the OECD. These data include internationally comparable test scores in reading, mathematics and science and information on students and schools. The country-level data are from the OECD and The World Bank.
The first wave of PISA was presented in 2000 and, since then, every 3 years a representative sample of students from all participating countries is subjected to tests on reading, mathematics and science. The test results are standardized to a mean of 500 and a standard deviation of 100 on the PISA reading test in 2000 for the OECD countries. 3 In addition to the tests, the students and school principals are surveyed. A total of 75 countries participated in PISA 2009. Since these countries are diverse in their economic development, we use a selection of comparable Western countries to limit country heterogeneity. All 31 countries in this analysis have a Gross Domestic Product (GDP) per capita above the minimum of the OECD and available data on national tracking policies. 4 These limitations on the sample are imposed to ensure that no country differences drive the results, although also country fixed effects models are used to further take country difference into account.
A representative sample from each participating country is obtained by the OECD in two stages: First, schools are selected and, then, students of the target age are selected within these schools. The target age is set to a range of 15 years and 3 months to 16 years and 2 months (OECD 2010). Since not all selected schools and students were willing to participate and some schools and students were oversampled to obtain extra information on these groups, the OECD provides weights to ensure sample representation. The student sample in this analysis consists of all native students in (pre-) vocational or general education who were in schools where more than five students participated in PISA 2009. 5 This amounts to 187 768 students in 7489 schools in 31 countries.

Tracking and selection by schools
We define tracking as the separation of students into tracks that differ in academic orientation and curricula. The extent of tracking is measured using the 'number of school types or distinct educational programmes available to 15-year-olds' (OECD, Table 5.2, 2007), as shown in the first column of Table 1. This measure of tracking is different from those used in some other papers. For instance, Fuchs and Woessmann (2007) and Elk, van Der Steeg, and Webbink (2011) use the age at which a student is first selected into a track as a measure of tracking, while Hanushek and Woessmann (2006) and Schuetz, Ursprung, and Woessmann (2008) divide countries into early versus late trackers. However, this measure of tracking is more consistent with our theoretical expectations from Section II. As discussed in Section II, we expect that the greatest effect from tracking will come from this imposed homogeneity. A disadvantage of using the number of tracks as our tracking measure, however, is that it does not take into account the amount of time students spend in the tracks. Section VI addresses this issue.
Besides a measure on class homogeneity due to tracking (the number of tracks), our framework in 3 The OECD provides five plausible values estimated using item response theory for the test scores since students do not receive all questions. Here only one plausible value is used. 4 Countries that are excluded are Australia, Canada, France, Mexico and the United Kingdom. We have excluded France since we have no data on schools, included the needed school data on whether principals consider prior performance. We excluded Turkey since we have no information on the language spoken at home which is an important individual-level background characteristic. The United Kingdom and Australia we excluded since for a large number of observations we do not have data on the proxy for the track the student is in. Canada is excluded since all students are coded as neither vocational nor general education. Mexico is excluded since the values on many variables are outliers: The mean and/or maximum values of the parental background variable and a number of school variables are at least more than half a standard deviation from the mean and/or maximum values of the other countries. 5 We include only native students since the literature shows that native and migrant students respond differently to system characteristics (e.g. Dronkers, Van Der Velden, and Dunne 2012).
Section II requires a school-level variable depicting the selection mechanism with which students are selected into tracks. PISA 2009 contains a proxy for such a variable, namely an index based on a PISA 2009 question to school principals on how often consideration was given to a student's record of academic performance (including placements tests) and to feeder school recommendations in admitting the student to the school. Schools are divided into three categories: (1) schools where neither of the two factors is considered, (2) schools where at least one of these factors is sometimes used to decide acceptance and (3) schools where at least one of the two factors is always considered. In this article, whether schools consider prior performance on acceptance to the school is used synonymous to whether schools have performance criteria for track placement of students. This assumes that secondary schools that have prior performance information use this to decide on the track placement of students. Table 1 gives an overview of the percentage of schools per country that consider prior performance. Countries differ substantially on the percentage of schools that consider prior performance, from 79.6% of students in schools in Spain that never consider them to 93.5% that always consider them in Croatia. The type of school that considers prior performance also differs across countries: for instance, in the Czech Republic and Hungary, schools that do are often upper secondary schools, while in Austria and Poland, it are mostly schools where students with a high PB attend. In general, village schools or schools without neighbouring schools are less likely to consider prior performance; schools that service more girls, vocational students or students in upper secondary school, higher PB schools and schools which have more teacher shortages are more likely to consider prior performance.
As can be seen from Table 1, in almost every country there are schools in all three categories, that is, schools that never, sometimes and always consider prior performance. Thus, even in systems with a high number of tracks, some schools do not use prior academic performance or teacher recommendations to decide school admittance. In the seven countries with four tracks, only 55% of the students go to schools that always consider prior performance; in the two countries with five tracks, 18% of students go to schools that never consider prior performance. Maybe more surprising, even in a comprehensive system, some schools consider prior performance when accepting students: In the 10 countries with only one track, 45% of students went to schools that consider prior academic performance. A reason could be that also in countries without tracking, non-institutionalized forms of tracking (ability grouping or seating) exist, which could induce schools to select students based on prior performance. The mechanisms in those countries could work in similar ways as described here. For instance, Lucas (1999) has shown that various methods of placement in non-institutionalized tracks in schools in the US can produce variation in the strength of the effect of prior performance and PB on student performance.

Control variables
The control variables used represent a standard set of variables used in the literature. All student variables are collected through student surveys. This study controls for gender, age, PB, 6 a dummy for (pre-) vocational education as opposed to general education and a dummy for upper secondary school as opposed to lower secondary education. PB is captured by a widely used index composed by the OECD that describes the student's economic, social and cultural status. The division between upper and lower secondary schools is based on the International Standard Classification of Education level, which provides internationally comparable standards for comparing education levels.
The school-level variables are collected through a survey completed by the school principals. School composition is captured by the school average and standard deviation of the PB of all the students per school, and by the percentage of them who speak a language other than the test language at home. 7 School inputs are captured by the student-teacher ratio, an index of possible teacher shortages, dummy variables indicating whether the school is hindered by a shortage in instruction material and an index indicating whether the school is responsible for the curriculum and assessment. Other school characteristics, all obtained from the school principal survey, indicate whether school achievement is tracked by an educational authority; whether the school is a public, private government-dependent, or private government-independent school; whether the school has to compete with none, one, or two or more schools for students; the school location; school size; and whether the school uses ability grouping. We also control for the GDP per capita PPP (constant 2005 international dollars), which is for 2008 from the World Bank (2012). Table 2 provides descriptive statistics for all variables.

IV. Empirical strategy
The aim of this article is to investigate whether using performance to select students into tracks changes the relation between tracking and student performance and educational opportunities. To answer this question we make use of three models: a between-schools between-countries model depicted in Equation (1), a within-country model depicted in Equation (2) and a between-countries model depicted in Equation (3).
6 PB is an important control variable since it is very well established that parents have a large influence on student performance. Among our 31 countries, the correlation between PB and the reading score is between 0.23 and 0.49, with the lowest score in Iceland and the highest in Hungary. Countries without tracking at age 15 have a correlation of 0.30, while countries with tracking have a correlation of 0.37. 7 To calculate the school PB composition, both native and immigrant students are used.
In these equations, Test isc is the individual PISA test score in reading, mathematics or science of student i in school s in country c. Student isc is a matrix of student variables, School sc is a matrix of school variables, while EntrReq sc is a matrix containing the dummies on whether schools consider prior performance. No: of Tracks c is a vector containing the number of tracks available to students in each country. GDPpc c is a vector containing GDP per capita. Our main coefficients or interest of Equation (1) are β 3 , β 4 and β 5 which capture the effect of the number of tracks combined with whether the school principal considered prior performance. 8 As compared to the main model in Equation (1), model (2) adds country fixed effects, C c , while in model (3) we use the national percentage of school whose principals sometimes or always consider prior performance on accepting the student to the school, Nat% EntrReq c , as opposed to the school-level variable. We use random effect models, which are estimated using maximum likelihood, to take into account error terms for countries, schools and individuals: Separate error terms are therefore included for countries, schools and individuals, since students are nested within schools within countries. Weighting is used to ensure representative samples. 9 The control variables are like discussed in Section IV.
The reason we analyse three models is that we are concerned with multiple selection issues. First, we add to our standard model (Equation (1)) also a country fixed effects model (Equation (2)) to exclude the possibility of country heterogeneity biasing the results. By using a country fixed effect model, we only compare outcomes of students in schools in which the principals does or does not consider prior performance, given the country and given the number of tracks which are available in that country. The interaction between the number of tracks and whether principals consider prior performance gives us the estimates we are interested in. The main tracking effect, which drops out, will be captured in the country fixed effects. Second, we use also between-country analyses (Equation (3)) to exclude the alternative explanation of why school principals would consider prior performance, namely to select the best students and thus the best performances for their school ('cherry picking'). Even given our alternative models, it could still be that all countries with a lot of tracks and which have schools that select students are different from all countries that do not. Our country-fixed effects models would not be able to control for this. Furthermore, although we control for a large number of school characteristics and also check for sorting into schools using our between-country analyses, unobservables which are related to student performance, tracking and whether schools select students based on performance and which would cancel out on the country level could potentially exist.
To study the hypotheses on student performance, the main focus lies on the interaction between tracking and whether schools consider prior performance, EntrReq sc Â no: of Tracks 0 c . Our hypothesis is that this interaction will be positive: School principals that consider prior performance in selecting students into tracks are more likely to achieve positive tracking effects due to more homogenous classes.
To look at whether the effect of PB is lower when track placement is done based on prior performance we use Equations (4) and (5). Equation (4) contains an interaction between PB and the number of tracks in a country, and will show whether in countries with more tracks there is a larger effect of PB. Equation (5) then adds the interaction between number of tracks, PB and whether school principals consider prior performance. This triple interaction will show whether the effect of PB is lowered when track placement is done based on ability. The PB of students is included in all models in the vector, Student isc .
Test isc ¼ ρ 0 þ Student isc ρ 1 þ School sc ρ 2 þ EntrReq sc ρ 3 þ No: of Tracks c ρ 4 þ PB isc Â No: of Tracks 0 c ρ 5 þ PB isc Â No: of Tracks 0 c Â EntrReq sc ρ 6 þ GDPpc c ρ 7 þ t c þ t sc þ π isc (5) Almost all student-and school-level variables have some missing observations (see last column of Table 2). Although most variables have below 3% missing values, deleting all observations with missing variables would lead to a drop in observations from around 185 000 to around 130 000, and it would assume that the missing values are missing at random, which is a questionable assumption. Another reason for not deleting all observations is that it leads to distorted weighting. Therefore, the missing values in the sample are replaced by group averages. 10 To control for possible bias introduced by the method for replacing missing values, imputation dummies and imputation interactions are used in all models. 11

V. Results
First, we replicate the standard cross country analysis of the effect of tracking on student performance. Then, we turn to investigating whether using performance to select students into tracks has an effect on the relation between tracking and student performance. We first present the results on the betweencountries between-schools models, and subsequently show the between-countries and the within-country results. And third, we focus on whether using performance to select students into tracks has an effect on the relation between tracking and educational opportunities.

Direct relation between tracking and student performance
The analysis starts by investigating whether the number of tracks has a direct and significant relation with student performance. Since we include a wide variety of school background variables, which may capture part of the tracking effect, we do not expect 10 The student variables are replaced by the average value of the students in the same school, the school variables are replaced by the country average.
Country variables are never missing. 11 The results are robust to the exclusion of the imputation variable interaction terms, and imputation dummies. a large coefficient for tracking. Table 3 confirms our expectations: The association between the number of tracks and performance is insignificant, while it is negative for reading and positive for mathematics and science. For all three test subjects, the relation between schools that sometimes consider prior performance and student performance is negative, while for schools that always consider prior performance this relation is positive (and significant).
All the control variables, which are excluded from the table, have the expected sign. The results of the full models are available upon request.

Tracking and performance
To test whether considering prior performance to select students into tracks has an effect on the relation between tracking and student performance, we include interactions between whether schools consider prior performance and the number of tracks, as in Equation (1). The results can be seen in Table 4. More tracks in an education system are positive for student's performance if students attend schools where the principals consider prior performance on accepting the student to the school. For reading there is still a significant negative effect of more tracks (−5.74**), but this is compensated when schools always consider prior performance and there are 4 or 5 tracks to choose from. Since for mathematics and science no significant negative coefficient of tracking exists (−0.88 and −2.89), more tracks are even better for students when they attend schools which always considers prior performance.
To facilitate the interpretation of the interaction terms, Figure 1 shows the combined coefficients for the three models. Figure 1 shows for each combination of number of tracks and whether schools consider prior performance what the relation between the two and student performance is, relative to students in a system with only one track in schools that never consider prior performance. Looking at the figures, one sees the same trend for all three subjects (reading, mathematics and science): schools in multiple track systems do better when they consider prior performance more often, while schools in comprehensive systems perform better when they do not consider prior performance. When only the significant differences in the graphs are considered, it becomes clear that for two or more tracks whether schools consider prior performance only changes the results when schools always consider prior performance. The coefficients for Never and Sometimes are not significantly different from each other when the number of tracks is two or more. 12 The models in Table 4 use the number of tracks in a country as a continuous variable ranging from zero to four. Appendix A found in the online Supplemental material contains non-linear models which include the number of tracks in a country as dummy variables. The results are qualitatively similar.
From Table 4 it can be concluded that students in a system with a high number of tracks do better when their school always considers prior performance, while for students in a system with a low number of The table presents coefficients from random effects models (standard errors in parenthesis) on the relation between student performance and the number of tracks in a country, controlling for whether or not schools consider prior performance when selecting students. The superscripts * and ** indicate significance at the 10% and 5% levels, respectively. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms. The table presents coefficients from random effects models (standard errors in parenthesis) on the relation between student performance and whether or not schools consider prior performance when selecting students and the number of tracks in a country. The superscripts ** and *** indicate significance at the 5% and 1% levels, respectively. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms.
tracks (two or three) whether schools consider prior performance do not seem to matter. In a system with four or five tracks, the schools that consider prior performance can place students into the available tracks based on this information, and thus in these schools class homogeneity is higher as compared to schools that do not obtain information on prior performance of their students. The data suggest that to place students into only two or three tracks is not beneficial for student performance regardless of whether schools consider prior performance. A possible explanation for this is that two or three tracks do not allow for enough differentiation between students with heterogeneous ability. In a system with one track whether schools consider prior performance still matters for student performance: Although students in schools that sometimes or always consider prior performance do not perform differently from each other, schools that never consider prior performance perform (marginally) better. The results presented above will be biased if the variation in the data is affected by sorting of students  into schools. This may be the case when schools that consider prior performance on accepting students do so to be able to select the most able students and not to allocate students into tracks. Under the assumption of full sorting into schools, it is to be expected that schools that consider prior performance will have students that always perform better. However, this is not the case: In systems with only one track, whether schools consider prior performance does not seem to matter much (it is only marginally better to be in a school that does not consider prior performance than to be in one which does) and in a system with only a few tracks (2 or 3), whether schools consider prior performance do not seem to matter at all. Therefore, schools which consider prior performance when deciding on accepting the student to the school do not perform better by definition.
That the hypotheses following full sorting do not seem to be confirmed by the data does not mean sorting is not a problem in these analyses. Sorting in lesser extent can still exist and could potentially bias the results. To investigate this, we present results using only between-countries variation, as shown for mathematics in Table 5. Appendix B found in the online Supplemental material shows the same comparison for reading and science. The first column of Table 5 replicates column (2) from Table 4 for comparative purposes. The third column of Table 5 replaces the school-level dummies on whether schools consider prior performance by variables depicting the national proportion of students in schools that sometimes or always consider prior performance. This model excludes the possible sorting of students into schools since this micro-phenomenon cannot intervene with the estimation when whether schools consider prior performance is measured at the country level. Schools that always consider prior performance have a negative impact on student performance (−134.23***); however, this is compensated in countries with more than 3 tracks due to the positive interaction term with the number of tracks (49.89***). 13 For an average country, where 29% of students are in schools that sometimes consider prior performance and where 32% are in schools that always consider prior performance, performance for reading is best if there is one track and performance for mathematics and science is best if there are five tracks. For a country with a high percentage of schools that always consider prior performance, students perform best in a five-track system, regardless of the subject. 14 For countries with high numbers of schools that never or sometimes consider prior performance, students perform best with two tracks, also regardless of the subject. 15 There results seem to indicate that, although sorting into schools could be a problem, it is unlikely that it alone drives our results.
We are also concerned that country heterogeneity could influence our results. For that reason Table 5 also contains a country fixed effects model using mathematics as dependent variable. The second The table presents coefficients from random effects models (standard errors in parenthesis) on the relation between student performance and whether or not schools consider prior performance when selecting students and the number of tracks in a country using three specifications. Column (1) shows the main model as depicted in column (2) of Table 4. Column (2) shows the same models but with country fixed effects included. Column (3) measures the school variables 'school principal consider prior performance' on a national level, and thus depicts the proportion of schools (between 0 and 1) in the country with school which say they always or sometimes consider prior performance. The superscripts *, ** and *** indicate significance at the 10%, 5% and 1% levels, respectively. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms.
column of Table 5 shows the same model but now including country fixed effects to check whether country heterogeneity influences the results. The interactions are a bit smaller (2.28** versus 2.50** and 5.11** versus 7.85**), but qualitatively similar. For all subjects, students perform best in systems with a high number of tracks when schools always consider prior performance. As also pointed out by Schuetz, Ursprung, and Woessmann (2008), a model with country fixed effects provides unbiased results for cross-country analysis, assuming that the existing country heterogeneity does not influence the interaction between whether schools consider prior performance and the number of tracks. Although this model still does not allow for a strict causal interpretation, the assumption required is considerably weaker than the assumption that no unobserved country heterogeneity exists, even with a sample of similar countries.

Tracking and inequality
The results for whether using performance to select students into tracks has an effect on the relation between tracking and educational opportunities are displayed in Table 6. As expected and consistent with the literature, PB has a positive and substantial relation with test scores. This effect is similar over the three PISA test subjects, reading, mathematics and science, and comprises about a quarter of a standard deviation in the test scores. If we look at the interaction between PB and tracking in columns (1)-(3), it can be seen that tracking mitigates the association with PB, and therefore reduces inequality of opportunity: The interaction of the number of tracks and PB is negative and highly significant. The first three columns show that in a system with five tracks the association of PB is lowered by 17.3 points in reading, 16.2 in mathematics and 18.2 in science. Finding that tracking reduces inequality is not fully consistent with the literature, which most often finds that tracking increases inequality or has no effect. To further investigate the drivers of this positive effect of tracking, we show in columns (4)-(6) in Table 6 the results of the same models, but now including also interactions between PB, the number of tracks and whether schools consider prior performance on deciding to accept the student to the school. This allows us to check if for schools that consider prior performance, and are assumed to use this information for track placement, the effect of PB is lower. The last three columns show that the interaction between the number of tracks and PB is no longer significant; however, the triple interactions (number of tracks, PB and whether schools consider prior performance) are. The table indicates that it is primarily the schools that always consider prior performance in systems with multiple tracks that mitigate the relation of PB with performance. Thus it is not tracking itself that diminishes the association of PB and performance but, rather, tracking combined with whether schools consider prior performance. This is consistent with our expectation that when schools consider prior performance, parents have less influence on their child's track choice and subsequent performance.

VI. Robustness
In this section, we consider other possible distorting factors for our analyses: heterogeneous effects for PB groups, the measure of tracking used and the sample The table presents coefficients from random effects models (standard errors in parenthesis) on the relation between student performance and parental background, whether or not schools consider prior performance when selecting students and the number of tracks in a country. The superscripts *, ** and *** indicate significance at the 10%, 5% and 1% levels, respectively. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms.
of countries. We report all robustness checks with the PISA mathematics score, unless otherwise stated the results for reading and science are robust to the various checks. 16 The results which are not shown are available on request.
Sample split on PB Table 6 indicates that the differences between students of low and high PB are minimized in a highly differentiated system. To investigate whether the effects of tracking and whether schools consider prior performance on accepting students on equality of opportunity are indeed different for students of different socio-economic background, we estimate models for the subsamples of low and high socioeconomic background students. Table 7 shows the results. For students with high PB (column (2)) the relation between tracks and performance is negative, irrespective of whether schools consider prior performance. However, for low PB students (column (1)) the number of tracks does not alter educational opportunities. However, the triple interactions (PB, number of tracks and whether schools consider prior performance) show that when schools always consider prior performance, a high number of tracks are beneficial for low PB students since their disadvantaged background has less of an effect on their performance.

Timing of tracking
Unlike some other papers on this topic, we use the number of tracks available to students at the age of 15 to characterize a country's tracking regime. Since the PISA test is conducted when the students are between 15 and 16, it is possible that although multiple tracks are available to students at the age of 15, students have not yet been tracked for a substantial amount of time. If this is the case, the association with tracking in late selection countries may be too weak to be picked up. To check for this, we redo the analysis including interactions between whether a country selects early, the number of tracks, and whether school principals consider prior performance. Table 8 shows the results for the mathematics score as dependent variable, using three definitions of early tracking: tracking before the age of 13, before 14 or before 15. Table 8 shows that only for the students in early tracking countries our results hold since only the triple interaction early tracking, the number of tracks and prior performance is always considered as significant. However, if we use the reading-or science score, then also the interaction for number of tracks and prior performance is always considered is significant (not shown). Our results seem, therefore, to be strongest for countries that track early, but are not absent for countries that track later.

Excluding countries without tracking
Even though we estimate country fixed effects models alongside our main models in Table 4, there could still be concern about difference among countries with different tracking regimes which could affect the relation between tracking and student performance, and the relation between PB, tracking and student performance. What could especially be of interest is whether countries that do not track at all are very different from the countries that do track. Therefore, we have also estimated our models without the countries that do not track their students. The table presents coefficients from random effects models (standard errors in parenthesis). Column (1) and (2) replicate column (5) of Table 6 but column (1) uses only low PB students, while column (2) uses only high-PB students. The superscript ** indicates significance at the 5% level. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms.
This leaves us with 21 countries and 126 333 students. The results are shown in Table 9. What we find when we exclude countries which do not track in Table 9 is that the effect sizes of the coefficients even larger than in Table 4. Only the coefficients from the between-countries model (Column (3)) are very different from before: The coefficients are closer to zero and far from significant. These results are very similar over the three test subjects. For the models on PB where only the interaction between PB and number of tracks is included, Column (4), the coefficient is half the size as before and is no longer significant. But including the triple interaction, Column (5), the results are very similar to Table 6. Also the PB results do not seem to be very different when non-tracking countries are excluded.

Excluding other countries
Using dummies for the number of tracks as in Appendix A found in the online Supplemental material reveals that especially countries with only a few tracks perform worse. Looking at the countries with only a few tracks, the three countries with two tracks (Israel, Greece and Chile) are among the worst-performing countries with regard to PISA test scores in reading, mathematics and science. However, excluding these countries does not influence the results much (not shown). The table presents coefficients from random effects models (standard errors in parenthesis) on the relation between student performance and whether or not schools consider prior performance when selecting students and tracking in a country. Column (1) uses as definition for early tracking country if the students are tracked before the age of 13, Column (2) uses before the age of 14, while Column (3) uses before the age of 15. The superscripts ** and *** indicate significance at the 5% and 1% levels, respectively. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms. The table presents coefficients from random effects models (standard errors in parenthesis) on the relation between student performance and parental background, whether or not schools consider prior performance when selecting students and the number of tracks in a country. The superscripts *, ** and *** indicate significance at the 10%, 5% and 1% levels, respectively. The control variables are as described in the text. The models include imputation dummies and imputation variable interaction terms.

VII. Conclusion
The variation in tracking in education systems throughout the Western world is quite large: Many countries have no tracking in secondary school (although most have some form of non-institutionalized tracking), while some countries distinguish up to five tracks for students. Also the manner in which tracking is implemented on the school level differs widely. Some countries put more emphasis on the use of performance to place students into the available tracks (e.g. the Netherlands), while in other countries parents have more influence on the track their child will go (e.g. Germany). We argue that the inconsistencies in the empirical results found in the literature could be explained by country differences in track placement. When track placement is done not based on an ability measure but mainly on PB, or, related, residential areas, the theoretical benefits of tracking might not arise. These theoretical benefits rely heavily on the idea that tracking leads to more homogenous ability classes which the teacher is better able to teach and where students benefit from a curriculum tailored to their needs and abilities. When students are placed into tracks based not on prior performance but on PB, classes will be more heterogeneous and students might not be taught a fitting curriculum. Furthermore, when PB is used to decide on track placement, educational opportunities are affected and inequality is likely to grow.
To study whether using performance to select students into tracks has an effect on the relation between tracking and student performance and educational opportunities, we use data of around 185 000 students in 31 comparable countries from PISA 2009. Prior performance can be thought of as an important measure of observed student ability. Therefore, for schools having the information on prior performance, it can help them allocate students across tracks, allowing for a better match between student ability level and track level which benefits the student, both by allowing the student to learn more and by limiting the effect of PB on student performance. We show that tracking in general does not have a direct relation with performance. On the other hand, interactions between tracking and whether schools consider prior performance reveal that students in highly differentiated systems perform best when schools always take into account prior performance to decide on student acceptance. In systems with a low number of tracks, whether schools consider prior performance has less of an impact.
The association between PB, tracking and student performance shows that equality of opportunity is best provided for in a system with a high number of tracks combined with schools always consider prior performance on accepting the student to the school. It turns out that for high-PB students in these systems, tracking weakens the positive relation between PB and performance, whereas for low PB students the (for them negative) relation between tracking and performance is lowered primarily when they attend schools that always consider prior performance. Thus it seems that high-PB students might be harmed by tracking when schools consider prior performance. The result, that tracking does not increase inequality of opportunity, if track placement is based on prior achievement, has recently been confirmed by Esser and Relikowski (2015). The authors compare student outcomes in two German states, one with strict assignment rules for track placement and one with less strict rules, and since they are able to control for prior achievement directly, they exclude selection into schools.
We argue that it is not straightforward to determine whether tracking in itself has a positive or negative effect on performance. When education system characteristics are studied, it should be taken into account that schools can have large influence on the implementation of these system characteristics and thus heterogeneous effects across schools can arise. We show that when tracking is combined with whether schools consider prior performance in accepting the student, tracking benefits both student performance and educational opportunities.
When more data become available in the future, these findings could be replicated controlling for individual prior performance (as Esser and Relikowski 2015) and by including more elaborate and specific information on the selection mechanism with which students are selected into tracks. the seminar participants of the CPB Netherlands Bureau for Economic Policy Analysis, the Journées Louis-André Gérard-Varet 2012, ESPE 2012, the Maastricht University economics of education lunch and the NWO-PROO group for useful suggestions.

Disclosure statement
No potential conflict of interest was reported by the authors.