Predicting the Kentucky Derby Winner! Sort of

Through a series of explorations, this article will demonstrate how the Kentucky Derby winning times dataset provides various opportunities for introductory and advanced topics, from data processing to model building. Although the final goal may be a prediction interval, the dataset is rich enough for it to appear in several places in an introductory or second course in statistics. After adjusting for the change in track length and track condition, winning speed has an interesting nonlinear trend, with one notable outlier. Student investigations can range from validating the phrase “the most exciting two minutes in sports” to predicting the winning speed of next year’s race using parallel polynomial models.


Introduction and Background
The Kentucky Derby is an annual horse race run at Churchill Downs in Louisville, KY, USA, on the first Saturday in May, timed well for when we are often first discussing regression in my introductory course or prediction intervals in my regression course. In 1875, 10,000 people gathered for the first horse racing spectacle in the US. Now, the Derby is the longest running sports event in America, and brings in a crowd of over 150,000 each year-more than the Super Bowl and the World Series. Although viewership has been in slight decline in recent yearsabout 16 million viewers tune in to NBC-betting totals are constantly on the rise. Wagering on the race alone brings in $150 million. As the fastest breed of horses, thoroughbreds can maintain a speed of 45 mph for over a mile, and thus the race is known as the "Most Exciting Two Minutes in Sports. " Since 1949, every winning time has been within 3 sec of the average of 2:02.34. It seems that thoroughbreds have plateaued, but humans are still improving running speeds; so, is there a limit to how fast an animal can run? I originally encountered this context in multiple regression homework exercises in The Statistical Sleuth. The third edition of the text (Ramsey and Schafer 2013) uses data from 1896 to 2011, with speed already computed for the students. I have continued to add to the dataset, primarily using data collected from kentuckyderby.com. Currently, it is not feasible to send students directly to the site to access the data (the historical data became more hidden, often cannot be copied and pasted, some browsers may block churchilldowns.com). The DoubleCone package (Meyer and Sen 2017) in R has the same variables from 1896 to 2012. The website horseracingdatasets.com includes some of the same variables and some additional variables as a GoogleDoc (Raceday 360 2015). The contribution of this article is access to an annually updated version of the dataset (all runnings of the race since 1875) and detailed discussion of how it may be used to discuss additional topics such as data processing, polynomial regression, transformations, and centering, in a manner that encourages students to take some ownership of the data.

Classroom Use
Rather than use this dataset as a stand-alone class activity, I tend to sprinkle it into the course at various points to illustrate key ideas and reinforce broader topics. For example, in the regression course with 35 students, each at a computer, I have students get their hands dirty with the data in an interactive class discussion. The first introduction early in the course focuses on preprocessing the data. Later, we return several times to the cleaned dataset when building regression models. Alternatively, the data lends itself well to a series of homework problems across 2-3 assignments. The following sections illustrate how the dataset can be applied across several topics, with possible extension questions depending on the exact material being covered in your course. See the Appendix for example class examples and homework problems.

Data Processing
With a focus on predicting the winning time in a particular year, students first examine a scatterplot of the winning times versus year. They immediately notice a large gap in the times (Figure 1). This is easily explained by the track length changing in 1896 from 12 furlongs (1.5 miles) to 10 furlongs (1.25 miles). Students can focus on the distribution of winning times in more recent years (which have been surprisingly stable, especially if we focus only on the fast track conditions). This can be contrasted with other events (e.g., Olympic sports) where records continue to be set, though often due to changes in venue and equipment.
Students can also be asked to create a new variable that computes the winning horse speeds, adjusting for the change in track length.
TEACHING TIP: See whether students can data sleuth and conjecture the explanation before you tell them about the change in track length (have them explore the data on their own, have them search online about the race). Deriving the formula for speed will be nontrivial for many students. Students can also use Excel to convert the times separately before and after 1896. Students should be able to confirm that the resulting values make sense in context (e.g., around 35 miles per hour). You can also provide students with the formula but ask them to explain why it works.
You can also ask students to compare the speeds before and after the track length change to determine whether the track length appears to impact speed. Students need to be aware of the confounding between year and track length through the remainder of the analyses. This analysis could point to deleting the years before the track change from the dataset instead.
For 2019, you could debate including the time of the fastest horse (later disqualified) or the time of the winning horse.

Initial Explorations
Students can be asked to subset the data after the change in track length and compare the distribution of winning times to the claim of the "Most Exciting Two Minutes in Sports. " They can also be asked to conjecture reasons for the variability in winning times from year to year, even after subsetting to the same track length.
TEACHING TIP: A one-sample t-test (H 0 :μ = 120 sec) using the times after 1896 is actually highly significant as only two horses have finished in under two minutes (Secretariat in 1970 andMonarchos 2001). The questionable validity of such a test on census data could lead to a discussion of population versus process-does the p-value still measure the strength of evidence against this type of data occurring by chance alone from a random process?
One variable students have access to is "track condition" as announced by the race organizers at the time of the race. There are currently over seven different track conditions, so students can also be asked to recode these as "fast, " "good, " and "slow. " Simple graphical explorations show that the winning speeds are related to these track conditions in the expected ways, but students might already begin to question the unequal sample sizes of these groups (see Section 7).
TEACHING TIP: Consider giving students the opportunity have their own opinions on how to recode the data! (I have put "dusty" in the "good" category, but most of the other categories were grouped into "slow. ") POTENTIAL PITFALL: If this is a homework assignment and you want students to all produce the same output, then be prescriptive in telling them how to create the categories and be ready to tell students how many races they should have in each category to check their work before they continue their analysis.
Another variable students have access to the surface of the track. You can ask students to consider whether surface is likely to be a useful variable in predicting speed, but preliminary exploration of the data reveals that at Churchill Downs that has always been dirt. Not only will the analysis not run with a constant variable with SD = 0, they should be reminded from earlier class discussion how standard errors of slopes have more precision for variables with more variation. They can consider the benefits of gathering data for more surfaces, with the drawback of losing the homogeneity of a single track. An extreme example, but also important to remind students to examine a variable before blindly adding to the model. I also sometimes ask students the obvious question of whether these data can be used to predict performance on a different type of surface.
The Sleuth data includes number of starters and students can investigate whether the number of horses in the race affects the winning speed. Another variable is pole position. Students may be interested in exploring the distribution of the pole positions of the winners. Another interesting question is how to take into account the number of starters in this comparison.
In the full dataset, even after adjusting for track length, there is a notable outlier, the 1891 race won by Kingman. Students can do some sleuthing, but other than the jockey being the only black jockey that year and possible collusion by the other jockeys to slow down the race, there are not good explanations for this record slow winning time (Churchill Downs Incorporated 2018).
TEACHING TIP: In the regression models discussed below, the 1891 race does stand out as having a large Cook's Distance compared to the other races, but smaller than 0.50, one of the commonly suggested cut-offs. Students can also perform a test of significance using the studentized residual for the outlier, highly significant even after a Bonferonni adjustment. Students can be asked to explain why the Cook's D value is not especially large when the winning speed is so unusual. It turns out that there is not enough leverage for the observation to be largely influential.

Simple Linear Regression
Once the response variable of speed has been created, students clearly see that the relationship between speed and year is curved and/or reaching limiting behavior and students can provide explanations for that behavior ( Figure 2). TEACHING TIP: This is a good use of a smoother to see the form of the association as well.
If a linear model is used, residual plots ( Figure 2) easily show the curvature as well, but also some evidence of decreasing variation in the residuals over time. The normal probability plot of the residuals is reasonably linear, though a test of normality will provide a small p-value (in large part due to the outlier). TEACHING TIP: Students can be reminded that although residual plots help identify an issue, they are not always helpful in pointing to a specific choice of remedy.
Students can consider whether a transformation on speed or time is likely to be successful. In particular, a transformation on the response variable could remedy all three violations. One observation is that neither variable shows a change in magnitude (e.g., the largest year is not more than 5 times larger than the smallest year).
POTENTIAL PITFALL: Another model assumption to consider is the independence of the observations. This is time series data. Not accounting for autocorrelation can lead to biased estimators and standard errors, resulting in model misspecification. I actually ignore this assumption when I use this data in my introductory and secondary course for nonmajors.

Polynomial Regression (With Centering) and Transformations (With Shifting)
A quadratic model actually fits the data reasonably well (Figure 3) and both the linear and quadratic terms are significant. The best feature is asking students to conjecture why the times would increase at a decreasing rate as represented by a positive coefficient on year and a negative coefficient on year 2 . Perhaps changes in technology have improved speeds but at a decreasing rate due to a physical limit of performance.
TEACHING TIP: Ask students to use the equation to make several predictions at different years, and see how the predicted increase in speed is decreasing as the year 2 term begins to dominate. However, students will be able to point out a weakness in the quadratic model assuming that this trend will continue (i.e., predicting winning times to eventually decrease). Students can also be asked to interpret the intercept of the equation and how it is too much of an extrapolation for these data to be meaningful.
The polynomial model has large variance inflation factors indicating a strong correlation between year and year 2 , so this provides a good opportunity to discuss how centering variables can remove that multicollinearity (and reduce the standard error of the year slope coefficient) as well as provide a more meaningful intercept.
TEACHING TIP: Although students can immediately understand why the multicollinearity is high between year and year 2 , it is more difficult to help them see why it is not between centered_year and centered_year 2 . Have them create the scatterplots of the two pairs of variables and think about how it is linear association that's problematic, not merely association, when fitting the response surface.
HELPFUL HINT: Some software packages (e.g., JMP) will automatically center the quadratic term but not the linear term. This makes interpretation of the intercept difficult. So I usually have the students first center the variable. Make sure students center the variable (or standardize) before squaring, rather than squaring and then centering.
In order for a log transformation to work well, the data should first be shifted (to have more change in magnitude in the variable and so the "curve" associated with the log function will match appropriately the location of the curve in the data). Students can subtract 1874 from each year and then compare the performance of the model of speed versus log(year -1874) to the quadratic model (Figure 4). The graphs will not provide a clear "winner" between the two models but students can be asked to think about the curved relationship vs. limiting function, especially in predicting future observations. TEACHING TIP: Students can compare R 2 and s between the models as we have transformed the explanatory variable rather than the response variable. (Other fit measures such as AIC, BIC, and PRESS can also be compared.) Have students practice interpreting the slope and the intercept with this shift and log transformation. The residual plots do look much better for the quadratic model.

Categorical Predictors
A scatterplot of the winning speeds versus year, coded by track condition (perhaps after recoding some of the categories), provides a nice visual display that makes sense to students (Figure 5). It is also informative to have students overlay the model for each track condition and to think about the relative distances between the speed predicted in a year for each condition (e.g., good and fast are more similar to each other than slow, but there hasn't been a slow track since 1947).
POTENTIAL PITFALL: Creating a visual like Figure 5 is trivial in some computer packages but will require careful instruction in others.
Students can also be asked to think about how the model is assuming the same relationship for each track condition, even the track conditions with only a few observations. We are able to make predictions for those rare track conditions, as long as we are willing to make that assumption. This is good practice for interpreting regression coefficients from effect or indicator parameterization and comparing the different p-values. This is a good use of a partial F-test for assessing the significance of the track condition variable, after adjusting for the year of the race. Students can compare the performance of this test before and after recoding the track condition to a smaller number of categories (and consider the "cost" of the extra degrees of freedom with more conditions).
POTENTIAL PITFALL: Many software packages now allow you to include a categorical variable directly, without needing to create indicator variables. However, students will often struggle for a while in making sense of the different parameterizations (e.g., 0-1 coding with a reference group vs. −1/1 coding reporting k − 1 of the effects).
Students can also explore the nature of an interaction between track condition and year. The interaction is statistically significant (another good partial F test example), and students can describe how the "effect" of track condition is narrowing over time, especially for the slow condition.
TEACHING TIP: Here it would be appropriate to compare the R 2 values (for example) across the models with different codings for track conditions, but also caution your students on overfitting the data.
POTENTIAL PITFALL: I have usually found this context of sufficient interest to students that they don't mind repeated looks at the same data (about 10% of my regression students will list this context as their favorite of all the data examples we look at in the course, third to sports data and the Donner Party), but it is important not to overdo this.

Prediction Intervals
Students can be asked to predict the winning speed for the next year's race, though students should also realize that this prediction is not very helpful in predicting the actual winner! Students can be asked to do this for different track conditions (which won't be known very far in advance) to compare the results. They will also notice that 95% and 99% prediction intervals include all the recent races and falling inside this prediction interval is not particularly impressive. After the race is run, students can compare their predictions to the actual outcome (converted to speed).
TEACHING TIP: Ask students to reconsider the residual plots and model assumptions (e.g., normality) before paying too much attention to these prediction intervals. Students can also compare the performance of the model with and without the race from 1891.
It should be pointed out that there are still some issues with the model assumptions, namely the heterogeneity in the residuals over time. Students can be asked to consider what impact this has on their predictions, for example, perhaps giving wider intervals for the more recent years than necessary. They can also be reminded that, for a confidence interval of the average time, the normality issue is less of a concern when the sample size is large, whereas normality of the population is still a requirement for validity of the prediction interval.
Again the issue can be raised that although this model explains a lot of the variation since the first running of the race, there is still a relatively large amount of unexplained variation in the past decade that these models are not sensitive to. Students can also consider which years are most informative in building this prediction interval; should we look at all races, races with a certain number of starters, or races in the most recent years? Students can use the prediction interval to decide whether Secretariat's record speed (1:59 2 / 5 , 1973) is considered an outlier.

Conclusions
Over this series of small explorations students have witnessed, in a genuine context, several important modeling principles, including: • The need for preliminary examination of the data and data preprocessing steps such as re-expressing variables and eliminating variables with too little variation. • Awareness of model assumptions (e.g., speed is not related to track length, using the same model across track conditions with fewer, older observations). • Consideration of multiple strategies for modeling nonlinear data, strengths and weakness of each approach, and how to interpret those models in context: -The ability of a quadratic model to "turn" with the data, though also consideration of how an ultimately decreasing trend may not make sense in context. -The ability for transformations to simplify relationships, especially for variables that change by an order of magnitude, and the ability of a shift in the data to improve the fit of the model. -The ability of power transformations for monotonic data to provide a model of observations approaching a limit and when that could make sense in context. -A reminder that not all behavior can be easily described by a simple mathematical model.
• Advantages of centering or standardizing variables to improve the fit of a model (reduce multicollinearity in polynomial models) and the interpretability of a model. • Use of categorical variables, including considering issues associated with choice of number of categories, and assumptions in parallel curves model. • Consideration of an outlier and its impact on the model. • Consideration of violations in model assumptions, why they are more critical in some situations than others, and the impact on the conclusions drawn from the model. • Consideration of subsetting the data to utilize data of most relevance to a particular prediction (sometimes using all the data is not the best strategy). • Application, interpretation (including with transformations), and critique of prediction intervals in this context. Ability to compare predictions to subsequent real-world outcomes.
Different elements can be selected and perhaps even expanded upon depending on the particular course content being covered. For example, in a data science course, students can design a scraping algorithm to extract the data from the website directly. Or in a time series course, students could consider the implications of using "time" as the independent variable and explore the statistically significant autocorrelation.
(b) Do you notice any unusual features? Can you suggest an explanation? What would you suggest as the next step in modeling these data? (c) In 1896, the race changed from 1.

HW 2 Problem
The KYDerby16.txt file contains information on the 142 Kentucky Derby races held on the first Saturday of May every year since 1875. The race is known as the "Most Exciting Two Minutes in Sports, " and is the first leg of racing's Triple Crown.
(a) Produce (and include) a scatterplot of the winning time (in seconds) vs. year. What is the overall pattern in these times over the years? Why does this make sense in this context? (b) The Derby was first run at a distance of 1.5 miles but in 1896 the distance changed to its current 1.25 miles (2 km confidence. (f) There is one unusual observation in the dataset but there is no good explanation for removing this observation. Is this race influential in this model?
Return to the (cleaned) Kentucky Derby data set.
(a) Fit a quadratic model to predict speed from year and year 2 and track condition. How much of the variation in speeds is explained by the track condition, after adjusting for the year? (Include appropriate calculations.) (b) Is the track condition factor statistically significant after adjusting for year? (Include a detailed statement of the null and alternative hypothesis, test statistic, degrees of freedom, p-value, and conclusion.) Hint: In JMP, see the "Effect Tests" output.
There are difficulties with using so many track conditions, e.g., some only have a few observations, it takes more degrees of freedom, and some of these descriptors aren't really used any more. So let's group some of the categories together. Report the number of observations in each of the three categories.
(c) Refit the model using the simplified track condition variable. Compare the adjusted R 2 of the two models, does this simplified model perform about as well as the more complicated model? Be sure to look at the fitted line plot. Interpret the three condition coefficients and why they make sense in this context. (d) (d) Use the model from (c) to predict the winning speed in 2017 on a fast track, with 99% confidence. How does it compare to your interval from problem 1? (Discuss both the midpoint and the width.) (e) (e) Now fit a model that includes interaction terms between Year and condition. Interpret the signs of the two interaction terms. What do they imply in this context? Would it be reasonable to remove both interaction terms from the model, or are we convinced at least one "population" slope is not zero? (State appropriate null and alternative hypotheses, one test statistic, df, p-value, and state your conclusion in context.) Name: KYDerby17.txt (tab delimited) Type: Census data on all runnings of the Kentucky Derby Size: 143 observations, 5 variables Year = year of race (1875-2017) Winner = First place winner (Horse name) Surface = surface of track (always dirt for this venue) Condition = rating of condition of track at start of race Starters = number of horses lined up at the start of the race (including last minute scratches in many cases and horses that did not finish) PolePosition = the winning horse's starting gate Time = winning time of the race (sec) Speed = speed of the winning horse (miles per hour)