Predicting body weight of South African Sussex cattle at weaning using multivariate adaptive regression splines and classification and regression tree data mining algorithms

ABSTRACT The use of multivariate adaptive regression splines (MARS) and classification and regression tree (CART) to estimate the live body weight at weaning age of the Sussex cattle breed remain poorly understood in South Africa. This study was conducted to examine the effect of linear body measurements on body weight at weaning using MARS and CART algorithms. The body weight and linear body measurements included sternum height, withers height, heart girth, hip height, body length, rump length and rump width were collected from 101 Sussex cattle (female = 57 and male = 44) at weaning. Goodness of fit criterions was used to select the best data mining algorithms. The results showed that MARS showed higher predictive performance in the criteria as compared to CART algorithm. The findings of the study suggest that MARS algorithm can be used to estimate the BW at weaning age in Sussex cattle breed. These findings might be helpful to cattle farmers in the selection criterions of breeding stock at weaning age.


Introduction
The Sussex cattle breed is characterized by a red to dark-red colour coat with a unique white tail switch of beef cattle from the Weald of Sussex, Surrey and Kent in South Eastern England (Annelie 2014).Sussex cattle animals are one of the oldest and pure breeds of English cattle used for meat production in the world (Bila 2019).At present, the Sussex cattle breed has both polled and horned strains available all over the world, including South African (SA) herds.Moreover, the Sussex cattle breed were selectively bred in the late eighteenth century to develop a contemporary beef breed, which is currently in used in various nations around the world (Annelie 2014).The first Sussex cattle breed, which involved 20 cows, was brought to South Africa in 1903 by Mr Alec Holm of the Potchefstroom College of Agriculture (SA Studbook and Animal Improvement Scheme 2004).Today, Sussex cattle stud and commercial herds are found throughout South Africa, Botswana and Namibia, after nearly a century, this testifies to their adaptability to South African climatic conditions (Bila 2019).However, there are other beef cattle breeds such as Bonsmara, Nguni, Afrikaner and Simmentaller that are found throughout South Africa provinces and used for beef cattle production (Mapholi et al. 2011;Pienaar et al. 2014;Sanarana et al. 2016;Grobler et al. 2018;Bareki 2019;Hendriks et al. 2022).The relationship between live body weight and linear body measurements assists to predict body weight, particularly in communal setups and pastoral production systems; thereby weighing scale is unavailable to describe the breed's live body weight (Fatih et al. 2021).Live body weight of animals in many parts of the world is the most economic important trait due to the fact that incomes are directly derived from the animal's body weight.Lately, animal breeders have paid more attention in enlightening the association between morphological traits and body weight with a specific objective to increase meat production (Faraz et al. 2021).Multivariate adaptive regression splines (MARS) is a non-parametric regression technique that does not involve some supposition around the dispersal of the variables and association between the variables entered into the predictive model to be built into statistical evaluation (Fatih et al. 2021).The classification and regression tree (CART) analysis is a great geometric approach that assesses the most imperative parameters in a specific data set and aids in designing a specific model (Fatih et al. 2021).Furthermore, CART is known as algebraic technique which is appropriate for different forms of data such as ordinal, nominal and continuous variables (Aytekin et al. 2018a;Tyasi et al. 2020).CART estimations can be used by animal keepers in decision-making processes regarding herd management.Faraz et al. (2021) reported that CART and MARS algorithms overcome the multi-collineality problems in estimating body weight.While, Celik et al. (2017) revealed that linear body measurements are positively effective on live body weight and assessed them as indirect selection criteria in sheep mating strategies.Several, reports have been published on predicting body weight from linear body measurements in diverse animal species such as cattle, sheep, rabbits, dogs and camels (Celik et al. 2017;Tandoh and Gwaza 2017;Aytekin et al. 2018a;Eyduran et al. 2019;Putra et al. 2020;Tırınk et al. 2021;Ağyar et al. 2022).The prediction of body weight from linear body measurements is of great importance for herd management, breed standards and national breeding schemes.However, to the deepest of our information, there is no literature well-known on predicting body weight at weaning age from linear body measurements in South African Sussex cattle breed.Hence, the objective of this research was to predict live body weight at weaning age from certain linear body measurements in South African Sussex weaner calve through MARS and CART data mining algorithms.

Study site
This study was carried out at Huntersvlei farm also known as Rhys Evans Group (RE) in the Free State Province, South Africa.The farm is situated in Viljoenskroon, Fezile Dabi municipality; the site, temperatures, latitudes, longitude and rainfall of the study area are similar as described by Bila (2019).Huntersvlei farm is one of the oldest and leading Sussex cattle stud herd in South Africa at present.

Animal management
All the animals used in the study were exposed to a traditional management grazing system which allows animals to freely graze in the camps during the day and afternoon.Fresh clean water was made available in the camps at all times for animals to drink.The animals received a routine inspection and dipping for herd health management purposes.The linear body measurements were taken while the animal was in a standing position with head raised up and weighed on all four feet.A functional handling facility with a crowding pen, working crush and head clamp was used for handling the animals to minimize movement during the measuring process.

Data collection
Linear body measurements and live body weight at weaning age were taken in 101 (female = 57 and male = 44) Sussex weaners.The animals used in the study at weaning age were between six to eight months old.The live body weight (BW) at weaning was measured using a balance weighing scale whereas linear body measurements were measured using a measuring tape calibrated in centimetres (cm).The body weight at weaning and linear body measurements (Figure 1) namely sternum height (SH), withers height (WH), heart girth (HG), hip height (HH), body length (BL), rump length (RL) and rump width (RW) were measured following the guideline defined by Bila et al. (2021) and Hlokoe et al. (2022).To prevent individual variations in the measurements, only one individual was taking the body weight and linear body measurements.

Statistical analysis
The statistical Package for Social Sciences (2019) version 26.0 with a probability of 5% 105 for significance was used to analyse the data for analysis of Student t-test to examine the influence of sex on linear body measurement traits.Decision tree algorithms were used to design the model to estimate BW from linear body measurements of SA Sussex cattle weaners using the study of Eyduran et al. (2019).The cross-validation was kept at 10, as recommended by Celik and Yilmaz (2018).Moreover, the goodness of fit criteria was used to perform the estimation of BW from linear body measurements of SA Sussex cattle weaners of MARS and CART (Faraz et al. 2021).

Multivariate adaptive regression spline algorithm
The multivariate adaptive regression spline algorithm is a nonparametric machine learning algorithm introduced by Friedman (1991) for handling pattern recognition problems in regression and classification for handling non-linear data.For this purpose, the MARS model fits a series of linear regression functions for predicting the values of the continuous dependent variable Iqbal et al. (2023).A multivariate adaptive regression spline algorithm was carried out as described by Hlokoe et al. (2022) and Iqbal et al. (2023).MARS data mining algorithm can be defined as: ) ) where f(x) is the estimated value of the dependent variable, β 0 and β m are intercept, h m (X v(k,m) ) is the basis function, where v(k, m) is an index of the predictor for the m th component of the k th product, K is the parameter regulating the order of interaction.After building the most suitable MARS model, the basic functions that did not contribute much to the model fitting performance were removed in the pruning process based on the following generalized cross-validation error (GCV) (Eyduran et al. 2019;Zaborski et al. 2019;Hlokoe et al. 2022).
where n is the number of training cases, y i is the observed value of a response variable, y ip is the estimated value of a response variable and M(λ) is a penalty function for the complexity of the model with λ terms.

Classification and regression tree algorithm
CART as replicating data mining algorithm tree was constructed by splitting a node into pairs of child nodes repetitively, opening with the root node that encompasses the whole learning sample following Breiman et al. (1984) and Kuhn and Johnson (2020) suggestions.

Goodness of fit test
The best model between MARS and CART was determined by calculating the goodness of fit test criteria (Celik and Yilmaz 2018;Hlokoe et al. 2022).The following goodness of fit was used in the study.

Pearson's correlation coefficient
Pearson's correlation coefficient (r) is a descriptive statistic, meaning that it summarizes the characteristics of a dataset.Precisely, it describes the strength and direction of the linear relationship between two quantitative variables.

Adjusted coefficient of determination
Adjusted coefficient of determination (ARsq) is an adjustment for the coefficient of determination that takes into account the number of variables in a data set.It also penalizes one for points that don't fit the model.
2.7.6.Root-mean-square error Root-mean-square error (RMSE) is defined as the square root of the mean squared error and is also known as the standard deviation of the residuals (prediction errors).Smaller values of RMSE are desired.The formula for calculating the RMSE is given as follows:

Standard deviation ratio
Standard deviation ratio (SDR) is another evaluation measure used to assess the performance of fitted models.It is measured by taking the ratio of the observed to the model's fitted values.

Akaike information criteria
Akaike Information Criteria (AIC) is a method for evaluating how well a model fits the data.The best model for the data is chosen by comparing its fit to the data using the AIC.
2.7.9. Mean absolute percentage error Mean absolute percentage error (MAPE) is a popular measure of prediction error.The MAPE measures the size of the error in percentage terms and is hence easy to interpret and understand.The smaller the MAPE, the better the prediction.The MAPE is defined as follows: Yi − Y _ i Yi

Global relative approximation error
Global relative approximation error (RAE) it calculates and returns global relative approximation error.

Coefficient of variance
Coefficient of variance (CV) is a statistical measure of the relative dispersion of data points in a data series around the mean.In finance, the coefficient of variation allows investors to determine how much volatility, or risk, is assumed in comparison to the amount of return expected from investments.
Package in R was used in the statistical evaluation of the MARS and CART data mining algorithms using EhoGof package (version 0.1.1)introduced by Eyduran (2020).

Results
Descriptive statistics of body weight at weaning age and linear body measurements for female and male calves are presented in Table 1.The descriptive statistics of body weight and linear body measurements for male and female calves discovered a significant difference (p ≤ 0.05), except for RL and RW.Table 2 denotes Pearson's correlation coefficients for predicting the association between body weight at weaning and linear body measurements.In the female calves, the body weight at weaning age was highly correlated with HG 0.76 (p ≤ 0.01) while insignificant correlated with SH 0.23 (p ≥ 0.05).The body weight at weaning age was found highly significant (p ≤ 0.01) correlated with WH (0.60), HH (0.62), BL (0.44), RL (0.41) and RW (0.60).In the male calves, the body weight at weaning age was highly correlated with HH 0.74 (p ≤ 0.01) while insignificant correlated with SH 0.24 (p ≥ 0.05).Lastly, the body weight at weaning age was found highly significant (p ≤ 0.01) correlated with WH (0.51), HG (0.51), BL (0.63) and RW (0.53) while RL (0.33) significant (p ≤ 0.05).
Table 3 shows the MARS model results.The first term of the model had an intercept that had a coefficient of 214.54.The second term, HG, had a cut-point of 141 cm for a positive coefficient of 2.65.The third term, HH, had a cut-point of 111 cm with a positive coefficient of 3.30.While the last term was RW, with a negative coefficient of 2.49.The basic functions that reduce the performance of the model obtained after the forward and backward pass stages were eliminated due to the GCV in MARS modelling.
Table 4 shows findings of CART data mining algorithms based on the cross-validation technique.The algorithms produced a tree structure of nine terminal nodes with the smallest relative error (the cross-validation error, 0.12) and mean of the error (0.43), which indicates that the cross-validation and coefficient of determination were close to each other.The leading predictors on BW at weaning age as a response variable were HG, HH, BL and RW.At the top of the regression diagram, generally the body weight at weaning age of the South African Sussex weaner calves was recorded as 250 kg.

Comparison of the MARS and CART algorithms
The goodness of fit criteria worked out for measuring the predictive performance, the test data set was used to make sound decision between MARS and CART data mining algorithms and the results are summarized in Table 5. MARS showed higher predictive performance in the criteria as compared to CART data mining algorithms.MARS data mining algorithm showed smaller RMSE, RRMSE, SDR, CV, MAPE, AIC, PI and MAD test results.While on the other hand CART data mining algorithm showed smaller r, Rsq and ARSq.In practice, the MARS model created a simple interpretable equation with the greatest performance in three predictors HG, HH and RW, compared to the CART data mining algorithm outcome based on the cross-validation method that produced 10 complexity parameters with nine numbers of splits.Figure 2 depicts graph of candidate model selection for the optimal MARS model.Moreover, MARS models were established with the use of train function in caret R package (Figure 2).Between the 185 candidate MARS models 2-38 terms and 1-8 degrees of interaction, MARS model of the seven terms with no interaction effect was selected as an optimal model that had the smallest cross-validated RSME value (Figure 2).
Figure 3 shows the regression tree diagram created by the CART algorithm in predicting body weight at weaning age from the linear body measurements.At the top most of the regression tree diagram, overall the body weight at the weaning age of the South African Sussex weaner calves was recorded as 250 kg.At the first tree average, the body weight was 226 kg Sussex weaner calve with HG <144 cm was lighter by 64 kg.At the second depth of the tree, the average body weight was 202 kg of Sussex weaner calve BL < 115 cm.The second depth of the tree, the average body weight was 241 kg of Sussex weaner calve HH < 116 cm.Furthermore, at the third depth of the tree, the average body weight was 235 kg of Sussex weaner calve.

Discussion
Celik and Yilmaz (2018) reported that the linear body measurements are positively influential on live body weight and are evaluated as indirect selection criteria in breeding strategies.The current study firstly investigated the relationship between linear body measurements and live body weight of Sussex cattle at weaning age.The correlation results indicated that the female calves, body weight at weaning age were highly correlated with heart girth while insignificant correlated with sternum height.In male calves, the body weight at weaning age was highly correlated with hip height while insignificant correlated with sternum height.The findings of the current study are in lined with the reports made by  The results of the present study showed that MARS had a higher predictive performance in the criteria as compared to CART algorithm.The findings of the study suggest that MARS algorithm can be used to predict the body weight at weaning age in South African Sussex cattle breed.MARS data mining algorithm had been recommended on prediction of live body weight in Nguni cows (Hlokoe et al. 2022).Furthermore, MARS had been recommended for the prediction of different factors in cattle breeding.For example, MARS    Çanga (2022) showed that MARS includes age, breed and live body weight for the prediction of carcass weight in seven cattle breeds.Karadas and Birinci (2019) showed that mastitis control before milking; birth month, lactation length and feed supply were important factors on average milk yield per cow.Aytekin et al. (2018b) showed that MARS includes chest circumference, back rump height, fattening period, body length, withers height, front rump height, back height and breed for the prediction of fattening final live body weight.

Conclusions
This research showed a positive correlation between body weight and certain linear body measurements of South African Sussex cattle breed at weaning age.MARS indicated that heart girth, hip height and rump width had an effect on the body weight of the South African Sussex cattle weaners.While CART indicated that heart girth, hip height, body length and rump width had an effect on the body weight of the South African Sussex weaners.Further studies need to be done on MARS and CART techniques with the purpose of improving BW of cattle breeds or more sample sizes of Sussex cattle.

2. 7
.10.Mean absolute deviationMean absolute deviation (MAD) is a measure of the error where the error is defined as the difference between the fitted values of the model and the actual (observed) values.Since the absolute errors are used, the MAD avoids the problem of negative and positive errors cancelling each other out.A smaller MAD value from a model is an indication of a better fit(Iqbal et al. 2023).
Tyasi et al. (2020) who revealed that male Nguni cattle, linear body measurements such as sternum height, heart girth, withers height and rump width had a significant positive correlation with live body weight.However, the correlation coefficient does not stipulate the effect of linear body measurement traits on the live body weight of animals, it only indicates the magnitude of the relationship among the measured traits(Tyasi et al. 2020).Hence the regression models have been employed to estimate the body weight from linear body measurements of Sussex cattle at weaning.Moreover, the regression models cannot overcome the challenge of the multi-collinearity(Mathapo et al. 2022).Hence, data mining algorithms had been used to predict live body weight from linear body measurements of the Sussex cattle weaners.In this study, MARS and CART data mining algorithms had been compared for the estimation of the live body weight of the South African Sussex cattle breed at weaning age.

Figure 2 .
Figure 2. Model selection for the optimal MARS model.
Performance index (PI) is used in technical analysis to compare a stock's price trend to the general trend of a benchmark index.
Mean error (ME) is an informal term that usually refers to the average of all the errors in a set.An "error" in this context is an uncertainty in a measurement, or the difference between the measured value and true/correct value.ME = 1 n n i=1 (y i − y ip )2.7.3.Performance index

Table 1 .
Descriptive statistics of body weight and linear body measurements in South African Sussex weaner calve.

Table 2 .
Correlation matrix of measured traits, female below diagonal and male above diagonal.

Table 4 .
CART algorithm outcome based on the cross-validation method.

Table 5 .
Goodness of fit criteria for MARS and CART algorithms.
Hlokoe et al. (2022)a)019)based on the training and test data set byÇanga (2022)on prediction of hot carcass weight of seven cattle breeds such as Aberdeen-Angus, Simmental, Limousine, Holstein -Friesian, Charolais, Zebu and Hereford using breed, age and live body weight.Karadas and Birinci (2019)used CART, Chi-Square Automatic Interaction Detector (CHAID), Exhaustive Chi-Square Automatic Interaction Detector (Exhaustive CHAID), MARS and Multilayer Perceptron (MLP) to determine the factors affecting dairy cattle and recommend that MARS outperformed the other algorithms used.Aytekin et al. (2018a)recommended the use of MARS data mining algorithm for the prediction of fattening final live weight from some-body measurements and fattening period in young bulls of crossbred and exotic breeds such as Holstein, Simmental and Brown Swiss.Our MARS model produced a simple interpretable equation with the greatest performance in three predictor's heart girth, hip height and rump width, compared to the CART algorithm outcome based on the cross-validation method that produced 10 complexity parameters with nine numbers of splits.The cross-validation results obtained by the MARS model produced 2-38 terms.Our findings differ from the findings ofFaraz et al. (2021)which produced seven terms without interaction in the model selection graph.Lastly, these results produced only eight degrees.Hlokoe et al. (2022)showed that MARS produced 10 important variables such as ear length, withers height, body depth, sternum height, rump length, bicoastal diameter, head length, hearth girth, body length and rump height for the prediction of body weight in Nguni cows.