Reading Difficulties Identification: A Comparison of Neural Networks, Linear, and Mixture Models

ABSTRACT Purpose We aim to identify the most accurate model for predicting adolescent (Grade 9) reading difficulties (RD) in reading fluency and reading comprehension using 17 kindergarten-age variables. Three models (neural networks, linear, and mixture) were compared based on their accuracy in predicting RD. We also examined whether the same or a different set of kindergarten-age factors emerge as the strongest predictors of reading fluency and comprehension difficulties across the models. Method RD were identified in a Finnish sample (N ≈ 2,000) based on Grade 9 difficulties in reading fluency and reading comprehension. The predictors assessed in kindergarten included gender, parental factors (e.g., parental RD, education level), cognitive skills (e.g., phonological awareness, RAN), home literacy environment, and task-avoidant behavior. Results The results suggested that the neural networks model is the most accurate method, as compared to the linear and mixture models or their combination, for the early prediction of adolescent reading fluency and reading comprehension difficulties. The three models elicited rather similar results regarding the predictors, highlighting the importance of RAN, letter knowledge, vocabulary, reading words, number counting, gender, and maternal education. Conclusion The results suggest that neural networks have strong promise in the field of reading research for the early identification of RD.


Predictors of RD
In this study, we focused on the prediction of RD in decoding and in reading comprehension (RC) in adolescence, which remains an understudied age group Manu et al., 2021;Ricketts et al., 2020). Since we focused on adolescent reading skills and our sample concerns a language with a highly transparent orthography (Finnish), our focal decoding measure is reading fluency (RF) and not accuracy, which by this age and in this context does not present variation that would have discriminative value (Aro, 2017).
Motivational factors also are important predictors of children's learning (e.g., Schiefele et al., 2012). In the present study, motivation was represented by task-avoidant behavior which in prior studies has been shown to be more prevalent among children with poor reading skills and may further impede their reading progress Hirvonen et al., 2010). Longitudinal research has indicated that higher task avoidance predicts slower reading development, even after controlling for motivational factors such as autoregressors or phonological processing and non-verbal IQ .
Home literacy environment, particularly shared parent-child book reading and instruction of literacy skills at home, has also been found to predict children's language and literacy development (e.g., Khanolainen et al., 2020;Mol & Bus, 2011;Silinskas et al., 2020a). Some evidence suggests, however, that home literacy environment may simply reflect parental skills and the correlation between home environment and child skills may in fact be due to the correlation between the skills of parents and their children. Puglisi et al. (2017) and Van Bergen et al. (2017) reported that the correlation between measures of children's literacy skills and home literacy environment largely lost its significance after controlling for parental literacy skills. In the light of this, research that examines the effects of home literacy environments should also include parental reading measures. Parental RD (or family risk for RD) have been identified as one of the best early predictors of children's RD (Esmaeeli et al., 2019;Psyridou et al., 2021;Snowling & Melby-Lervåg, 2016; Van Bergen et al., 2015). In the present study, we included family risk for RD, at-home learning activities, as well as parental education level as predictors of children's RD. Inclusion of parental education level is in line with studies documenting strong effects of parental education levels on children's reading development (e.g., Hamilton et al., 2016). Finally, the child's gender was included as a predictor as previous studies have suggested increased vulnerability for RD among boys (e.g., Manu et al., 2021;Psyridou et al., 2021;Quinn & Wagner, 2015).

Statistical models for the prediction of RD
There are multiple potential statistical models available for early identification of risk factors for RD. In the present study we focus on comparing two commonly used but differing approaches (linear regression and mixture model) and a newer neural network-based approach. The deep artificial neural

The present study
The present study aims to identify the most accurate model for predicting adolescent (Grade 9) RF and RC difficulties using a large set of kindergarten-age variables. Three models (neural networks, linear, and mixture) were compared based on their accuracy to predict RD, and to determine whether the same or different set of kindergarten-age factors emerge as the strongest predictors of RF or RC difficulties in the three models. The neural networks model was selected because it is a new method which has been broadly used in other fields with very promising results. We sought to examine whether the neural networks model can provide as high, or higher, accuracy in the identification of RD than the more commonly used methods. The linear model and mixture model were selected as contrast models to the new neural networks model because they are commonly used, have a differential approach to prediction but both use maximum likelihood methods, and there is a robust evidence of their usefulness in reading research.
To the best of our knowledge, no previous studies have compared these three models in the field of reading research. Previous studies in other fields that have used neural networks for prediction (without comparing them with linear and/or mixture models) have demonstrated high accuracy (e.g., Alzheimer's: Lu et al., 2018;ADHD: Kuang & He, 2014;autism: Heinsfeld et al., 2018;biomedicine: Mamoshina et al., 2016;Parkinson's: Choi et al., 2017;psychiatry: Vieira et al., 2017). Given these results, we hypothesized that neural networks would provide high accuracy in predicting RD.

Participants
The present study was part of the Finnish longitudinal First Steps Study that includes data from approximately 2,000 children from kindergarten to Grade 9 (Lerkkanen et al., 2006). Reading skills of the participants were assessed in Grades 1, 2, 3, 4, 6, 7, and 9. In the present study, we used data from children's assessment taken in the fall and spring of kindergarten (i.e., age 6), and their RF and RC skills assessment taken in Grade 9. There were 1,880 children at the beginning of the follow-up but when they entered school, all their classmates were also invited to participate. Over the years the sample size varied and altogether 2,518 children participated in at least one assessment. Descriptive statistics for the sample used in this study can be found in Table 2 (more information on the number of participants in each assessment and the descriptive statistics for the reading measures used in Grades 1-9 can be found in Psyridou et al., 2021). The sample was drawn from four municipalities -two in central, one in western, and one in eastern Finland -out of which one was mainly urban, one mainly rural, and two included both urban and semi-rural environments. In three of the municipalities, the participants comprised the entire age cohort of children, and in another municipality, the participating children comprised about half the age cohort. Out of the parents who were contacted, 78-89% agreed to participate in the study. Ethnically and culturally, the sample was very homogeneous and representative of the Finnish population, and parental education levels were very close to the national distribution of Finland (Statistics Finland, 2007). The university's Ethical Committee approved the First Steps Study, and the study conforms to the Declaration of Helsinki. All participants provided informed written consent and children gave their assent prior to their inclusion in the study.

Measures
The children were assessed longitudinally, in kindergarten (fall 2006 and spring 2007) and in Grade 9 (RF: spring 2016; RC (PISA): fall 2015). The measures are described in detail in Table 1.

Statistical analysis
Kindergarten-age measures were used to predict RF and RC difficulties in Grade 9 using three models (neural networks, linear, and mixture). RF and RC were analyzed separately, and RF or RC difficulty was defined as scoring in the lowest 10% of the RF or RC distribution, respectively.

Neural networks model
For the neural networks model, the Multilayer Perceptron Network (MLP) was used to produce a predictive model to identify RF or RC difficulties based on the kindergarten-age measures. MLP was conducted using SPSS (version 26). We first examined whether the missing-completely-at-random (MCAR) assumption could be established for the data. Little's MCAR test indicated that the data were not MCAR (Little, 1988). The effect sizes (Cohen's d) for the comparison of the participants with and without data ranged from .01 to −.39. However, because it is important for MLP not to have missing values, an expectation-maximization algorithm in SPSS was used to impute missing data using only the kindergarten-age variables so that the Grade 9 reading variables remained unchanged and independent from the kindergarten-age variables.
The MLP was set to randomly choose approximately 70% of the data for the training model and 30% for the testing model. Furthermore, we allowed the model to choose the number of hidden layers and units automatically. The training sample was used for the estimation of the weights. Once the weights had been decided, the testing sample was utilized to estimate whether the weights generated with the training sample apply to another subset of people and whether the model can be generalized. To balance the imputed cases between the training and testing models, the selection of participants for both models is important. Therefore, we used 20 different seed values to select and allocate were first asked to name aloud the objects and then identify the object with the same initial phoneme as the one spoken aloud by the assessor. All sounds were single phonemes.
A score of 1 was given for every correctly selected object. Max 10.
Cronbach's alpha = .75 (fall), .71 (spring) Letter knowledge (ARMI; Lerkkanen et al., 2006) Fall 2006, Spring 2007 29 uppercase letters arranged in random order across three rows. Students were shown the letters on a sheet row-byrow and asked to name them aloud. Either a phoneme or letter name was regarded as correct. The test was discontinued after 6 incorrect responses. at the fall assessment and 10 words at the spring assessment. Students were asked to read aloud the words. At the fall assessment, there were 4 two-syllabic words, 1 three-syllabic word and 1 five-syllabic word. At the spring assessment, there were 7 two-syllabic words, 2 three-syllabic words, and 1 five-syllabic word.
A score of 1 was given for every correctly read word. Max 10.
Cronbach's alpha = .84 (fall), .85 (spring) Listening comprehension (Vauras et al., 1995) Spring 2007 Groups of 6 students were read aloud a story (130 words), twice, and then asked six multiple-choice questions based on the story, one question at a time. In four of the questions there were three choices, and in two questions there were four choices Each question was accompanied by 3 or 4 pictures and student responded by marking the picture that correctly matched the story in their own test booklet.
2 points were given for every correct answer. Max 12.
Cronbach's alpha = .30/ Revelle's omega = .42 Number counting (for similar tasks, see, Koponen et al., 2007) Fall 2006, Spring 2007 Pre-math skills were assessed through four tasks in which children were asked to count aloud forward (from 1 to 31 and from 6 to 13) and backward (from 12 to 7 and from 23 to 1). Answers were recoded using a 3-point scale: basic education, vocational education, and university education.
-Home learning environment (Sénéchal et al., 1998;see, Silinskas et al., 2020b) 3-item questionnaire about at-home learning activities answered by the mothers. It included 1-item regarding shared reading which was answered using a 5-point Likert scale (1 = less than once a week, 2 = 1-3 times a week, 3 = 4-6 times a week, 4 = once a day, 5 = more than once a day) and 2-items regarding in-home teaching of literacy (teaching letters & teaching reading) which was also answered in a 5-point scale (1 = never at all/ rarely to 5 = very often/daily).
Each item was examined individually. -

Family RD
Parents were asked to indicate on a 3-point scale whether they had clear difficulties, some difficulties, or no difficulties.
A child was considered as having family risk if the mother or the father reported that she or he had experienced some or clear RD -participants to the training and testing models, randomly, using the continuous RF and RC variables. Because MLP results vary from analysis to analysis, the MLP was run 20 times using the same seed and the predicted RF and RC scores from each analysis were saved (i.e., 20 seeds were run 20 times each, separately for RF and RC). Next, the mean of the 20 predicted RF/RC scores was calculated for each seed and the correlations between the mean of the predicted RF/RC and the observed RF/RC scores for the training and testing model were estimated for each seed. For the 20 seeds, the correlations between the mean of the 20 predicted RF/RC scores and the observed RF/RC scores in the training (70% of data) and testing (30% of data) models varied from .49 to .62 for the RF models (Appendix B) and from .48 to .58 for the RC models (Appendix C). The seed with the closest and highest correlation among the training and testing model was used for the prediction of RF/RC difficulties. Next, using the best-identified seed, the MLP was run 20 times to predict the dichotomized RF and 20 times to predict the dichotomized RC variables with the kindergarten variables, and the predicted RF/RC scores from each analysis were saved. As the dependent variables are dichotomous, the predicted scores represent each participant's probability of belonging to the lowest 10% of the RF or RC distributions.
The average of the 20 predicted scores was used to test the ability of the model to predict RF and RC difficulties using Receiver Operating Characteristic (ROC) curves (Fawcett, 2006). The ROC curve is plotted with the true-positive rate (i.e., sensitivity) against the false-positive rate (i.e., 1-specificity) where the true-positive rate is on the y-axis and false-positive rate is on the x-axis. ROC curves compare sensitivity versus 1-specificity across a range of values for the ability to predict a dichotomous outcome. Each point on the ROC curve represents a sensitivity/1-specificity pair corresponding to a particular cutoff. The ROC curve is a useful tool because the curves of different models can be directly compared, either in general or for different thresholds, and the area under the curve can be used as a summary of the model skill. The larger the area under the curve, the better the identification of those with and without difficulties.
In addition to the ROC curves, we attempted to identify the most important kindergarten-age factors in predicting RF and RC difficulties. We conducted an independent variable importance analysis (provided by the MLP), which computes each kindergarten-age factor's importance in determining the neural network based on the combined training and testing samples. The analysis was conducted 20 times, and the mean of the normalized importance for each kindergarten-age factor was calculated based on the results from the 20 analyses.

Linear model
For the linear model, a logistic regression analysis with backward deletion (BReg) was used to identify the kindergarten factors that predict RF or RC difficulties in Grade 9. BReg was conducted in Mplus (version 8.7). Logistic regression is useful when we aim to predict the presence or absence of a characteristic based on a set of predictors. It is similar to linear regression but is more suitable when the dependent variable is dichotomous (in this case, RD vs. no RD). Even though stepwise regression (either forward or backward) has received criticism, it was selected because it allows us to prune a list of plausible explanatory variables down to a parsimonious group of the "most important" variables. In addition, this criticism is mitigated by having a separate training and testing sample. We used the backward selection because starting with the full model has the advantage of considering the effects of all variables simultaneously. This is especially important in cases of collinearity because backward selection may be forced to keep all the variables in the model unlike forward selection where none of them may be entered. Full information maximum likelihood was used as an estimator, and robust standard errors were calculated (MLR estimator in Mplus). The MLR allows the calculation of more unbiased standard errors. BReg analysis was done in a stepwise manner. All kindergarten-age factors were added to the model, then the factor with the highest p-value was removed (one factor at a time) until all remaining factors were statistically significant. By using the full sample (N = 1,880) without the testing sample as defined by the MLP (N = 1,535 for RF, N = 1,576 for RC), we identified the best minimal group of factors that predicted RF or RC difficulties. Predictive values for each individual were saved and used for the ROC curves and comparison of the models. ROC curves were produced using the same testing sample as in the MLP (N = 345 for RF; N = 304 for RC) to examine the model's ability to predict RF or RC difficulties.

Mixture model
For the mixture model, a latent profile analysis (LPA) was used to identify homogeneous profiles for the kindergarten-age measures based on the whole data set (N = 1,880). LPA was conducted in Mplus (version 8.7). To ensure each profile's validity, we used 500 starting values, as a large set of random starting values is recommended (Asparouhov & Muthen, 2008). In the best-fitting LPA model, the individuals are assigned to a specific profile based on the highest posterior probabilities. This information was saved to an Mplus file and used further in SPSS to compare the results with the other two models. Given that for the LPA we were unable to produce ROC curves, a cross-tabulation between the LPA profile grouping and the dichotomized RF and RC variables was used.

Model comparison
Finally, using the same 30% of the data used for the testing round in MLP and estimating the linear model, the three models' results were compared using a logistic regression to examine which method can produce the specific information that the others cannot achieve. The maximum likelihood function in logistic regression gives a chi-square test which tests the fit of the model. The dichotomized RF or RC variable was used as the dependent variable and was predicted using the predicted MLP scores (i.e., the predicted probability of the individual belonging to the lowest 10% of the RF or RC distribution based on the MLP analysis), the predictive BReg scores (i.e., the predicted probability of the individual belonging to the lowest 10% of the RF or RC distribution based on the BReg analysis), and the saved latent class information from the LPA (i.e., to which profile each individual belonged). The prediction scores and latent class information of the three models were used as independent variables. One method's predictive scores were added in the first block, and the predictive scores from the other method were added in the second block. The chi-square test indicates whether the predictor in the second block is more informative for predicting the dichotomous RF or RC variable than the predictor in the first block. For the most informative method, we also examined whether it was more informative than the other two methods combined.
Computational codes for all the analyses can be found in Supplemental Materials.

Descriptive statistics
The descriptive statistics of the kindergarten-age factors and the RF and RC measures in Grade 9 are presented in Table 2. The correlations between all assessed measures are displayed in Appendix A. Both RF (mean score) and RC significantly correlated with all kindergarten-age factors, except for teaching letters at home.

Neural networks model
For the best identified seed, the correlation between the predicted and observed RF scores was .58 for the training model and .58 for the testing model, while the correlation between the predicted and observed RC scores was .55 for the training model and .55 for the testing model. Using the best identified seed, the MLP was run 20 times to predict the dichotomized RF and 20 times to predict the dichotomized RC variables with the kindergarten variables. The average of the 20 predicted scores for RF and RC represents the predicted probability that the individual belongs to the lowest 10% of the respective distribution. The average of the 20 predicted scores for RF and RC were used to produce the ROC curves for RF and RC (Figures 1a and 2a    1-specificity was .37 and the precision was .24. (See also, Tables 3 and 4 for the corresponding sensitivity, specificity, and precision estimates for different cutoff values and Tables 5 and 6 for specificity and precision values corresponding to specific sensitivity values). Finally, the independent variable importance analysis indicated that the most important factors for predicting RF difficulties were RAN, letter knowledge (spring), and gender ( Figure 3). For the prediction of RC difficulties, vocabulary, RAN, and letter knowledge (spring) emerged as the most important factors (Figure 4).

Linear model
The estimates and p-values for each BReg step are presented in Appendix D for RF and Appendix E for RC. In the final RF model (Table 7), four significant predictors remained: gender, letter knowledge (spring), number counting (spring), and RAN. The model explained 32% of the variance in Grade 9 RF. In the final RC model (Table 8), five significant predictors remained: maternal education level, gender, reading words (spring), RAN, and vocabulary. The model explained 39% of the variance in Grade 9 RC.  For the testing data out of the 345 cases, the true positive cases identified both by the MLP and the BReg were 28.
Using the regression equation with estimated coefficients for the significant predictors of RF and RC, we calculated predictive values for each individual, representing the predicted probability that the individual belongs to the lowest 10% of the respective distribution. These predictive values first were used for the ROC curves, and then for comparing the models. Similar to the MLP, ROC curves were produced using 30% of the data to examine the model's ability to predict RF or RC difficulties (Figure 1b, 2b). The ROC curves suggested good accuracy in identifying RF and RC difficulties; the area under the curve was .72 (p < .001, 95%CI .607 -.822) for RF and .75 (p < .001, 95%CI .664 -.828) for RC. Based on the ROC analysis, for RF when the sensitivity was .82 the 1-specificity was ranging from .62 to .64 and the precision was ranging from .10 to .11. For RC, when the sensitivity was .82 the 1-specificity was ranging from .41 to .46 and the precision was ranging from .20 to .22. (See also, Tables 3  and 4 for the corresponding sensitivity, specificity, and precision estimates for different cutoff values and Tables 5 and 6 for specificity and precision values corresponding to specific sensitivity values). For the testing data out of the 304 cases, the true positive cases identified both by the MLP and the BReg were 38. For the testing data out of the 345 cases, the true positive cases identified both by the MLP and the BReg were 28. For the testing data out of the 304 cases, the true positive cases identified both by the MLP and the BReg were 38.   The standardized estimates are presented for each factor along with the p-value. Asterisks represent factors that have been deleted in previous steps. The estimates and the p-values for each BReg step are presented in Appendix D. For the identification of the best minimal group of predictors we used the full sample without the imputation of the data excluding the testing sample (N = 1,535). The standardized estimates are presented for each factor along with the p-value. Asterisks represent factors that have been deleted in previous steps. The estimates and the p-values for each BReg step are presented in Appendix E. For the identification of the best minimal group of predictors we used the full sample without the imputation of the data excluding the testing sample (N = 1,576).

Mixture model
Nine latent profile solutions were tested and compared, each testing a different number of profiles (1 through 9; Table 9). The model with eight profiles was considered the best-fitting model, as VLMR and LMR indicated that the models with more than eight profiles were not significantly better. The average latent class probabilities for belonging to a profile were very high, suggesting that the eight profiles were very distinct: .95 for Profile 1; .94 for Profile 2; .92 for Profile 3; .99 for Profile 4; .99 for Profile 5; .98 for Profile 6; .99 for Profile 7; and .99 for Profile 8. As Figure 5 and Table 10 show, the eight identified profiles included two profiles with belowaverage scores in all kindergarten skills and six groups with average or above-average scores. Profile 1 had low scores in all cognitive skills (particularly in phonological awareness, letter knowledge, and number counting) and had family risk for RD. Profile 2 also had below-average scores in all kindergarten assessments, but not as low as in Profile 1, and did not have family risk for RD. Furthermore, Profile 5 had average cognitive skills, but below-average reading skills. Each individual's profile membership was saved.
Cross-tabulation and a chi-square test were conducted between the profile membership variable and the RF difficulty (Table 11) and RC difficulty (Table 12) variables using the testing data (the same 30% as in MLP). The proportions of RF difficulty differed between the latent profiles [χ 2 (7) = 23.51, p < .01]. Similarly, the proportions of RC difficulty differed between the latent profiles [χ 2 (7) = 36.53, p < .001]. The adjusted standardized residuals indicated that both RF difficulties (adj. residual 4.06) and RC difficulties (adj. residual 5.04) were more common in Profile 1 than they would have been by chance.

Comparison of the three models
Next, we compared the three models' ability to predict RF or RC difficulties. The prediction scores and latent class information from the three models were used as independent variables and the dichotomous RF/RC variables as dependent in the logistic regression model. The chi-square test was used to reveal which model provided a prediction of reading difficulty that had additional information in relation to other models when trying to predict RD. The comparisons were performed using the 30% of data used for the testing model in the MLP.
For RF, the chi-square test indicated that the MLP increased the prediction over BReg [χ 2 (1) = fewer profile than the estimated model. A p-value of less than .05 shows that the estimated model is better and that the model with one fewer profile should be rejected. Entropy ranges from 0 to 1, and higher values show higher classification utility. In addition, the clarity of the latent profiles was examined by the average posterior probabilities for the most likely latent profile membership, which shows how distinct the profiles are.
In addition to the chi-square tests, the ROC curves for the BReg and MLP for RF ( Figure 1) and RC (Figure 2) suggest that for both skills, MLP had greater power to predict RF and RC difficulties. For RF, the area under the curve for the MLP was .79, while for the BReg, it was .72. For RC, the area under the curve for the MLP was .83, while for the BReg, it was .75.

Discussion
The complexity of risk and protective factor patterns in RD challenge the early identification of at-risk children. It is critical that we find methods that can handle the complexity of multiple interacting predictors to ensure that children are provided with timely support. The present study aimed to compare three models -linear (BReg), mixture (LPA), and neural networks (MLP) -based on their accuracy in predicting adolescent RD with multiple kindergarten-age variables. Furthermore, we examined whether the three models identified the same or different set of kindergarten-age predictors  .58 .07 F(7, 1797) = 48.54*** 1 < 2, 3,4,5,6,7,8 For the variables family risk and gender a cross tabulation and a chi-square was used. For the rest of the variables a one-way ANOVA was conducted. Post-hoc comparisons were conducted using either Bonferroni or Dunnett T3, depending on equality of the variances. The LPA was conducted only with the kindergarten-age assessments and without the imputation of the data (N = 1,880). *p < .05, **p < .01, ***p < .001. The cross-tabulation between the LPA profile grouping and the dichotomized RF variable was conducted using the testing sample (the same 30% of the data used for the testing round in the MLP) (N = 345). The cross-tabulation between the LPA profile grouping and the dichotomized RC variable was conducted using the testing sample (the same 30% of the data used for the testing round in the MLP) (N = 304).
of Grade 9 RF and RC difficulties. The linear model was used as a basis because of its regular use in reading research. The mixture model is another often utilized method which was included because, unlike the linear model, it can identify the patterns or profiles of individuals among multiple variables that may signal heightened risk for RD. Neural networks models, although widely used in other fields, have been rarely used in reading research so far. The results indicated that MLP was more informative in predicting RF and RC difficulties than the other two models, a finding in line with the notion that neural networks are a powerful tool for predictive modeling.

Accuracy in predicting RD
All the models significantly predicted Grade 9 RF and RC scores with the kindergarten-age variables. LPA identified a group that had considerably higher risk for RD and both BReg and MLP predicted significant amounts of variance in the Grade 9 RD. Utilizing the same data employed for the testing model in MLP (30% of the data), the area under the curve for the MLP was .79 for RF and .83 for RC, while for the BReg, it was .72 for RF and .75 for RC. A value between .70 and .80 is considered acceptable, while a value between .80 and .90 is considered excellent (Mandrekar, 2010). Even though the results from the MLP and BReg were both considered acceptable (for RC, the MLP produced excellent results), it seems that the MLP classification was somewhat more accurate. Comparing the MLP, LPA, and BReg showed that neither LPA nor BReg provided any additional information for predicting either RF or RC difficulties over the MLP. In contrast, the MLP provided additional information over the other two methods, both for RF and RC. In fact, the MLP was found to be even more informative in predicting RF and RC difficulties than the combination of information from the other two models. Although, to the best of our knowledge, there are no previous studies comparing the use of these models for the prediction of RD, the findings were in line with our initial hypotheses expecting MLP to show high accuracy based on previous studies' findings in other fields that have used neural networks for prediction (e.g., Alzheimer's: Lu et al., 2018;ADHD: Kuang & He, 2014;autism: Heinsfeld et al., 2018;biomedicine: Mamoshina et al., 2016;Parkinson's: Choi et al., 2017;psychiatry: Vieira et al., 2017). Linear models generally have been the dominant statistical models because their functions can be solved quickly, usually in one step, and their solutions are unique and represent a global optimum (i.e., the best possible solution among all solutions). Furthermore, the model fit works with relatively small samples, and the testing can be well-understood (Durstewitz et al., 2019). Compared with linear models, neural networks are more time-consuming and complicated, and it is often even impossible to identify the global optimum (Goodfellow et al., 2016). They also require large amounts of data for the training of the parameters (Durstewitz et al., 2019) and are more difficult to interpret (Urban & Gates, 2021). However, regardless of these difficulties, they can identify complex patterns in data and make very accurate predictions compared with linear models (Durstewitz et al., 2019). Furthermore, neural networks continue to work when the number of variables is larger than the sample size and in cases in which many weak factors correlated with each other influence the phenomenon (Urban & Gates, 2021). Another great benefit of the neural networks model is that the trained neural networks model can be generalized and used to perform the same task with new inputs (Cichy & Kaiser, 2019). In RD prediction models, as in ours, there are typically many kindergarten-age predictors which are correlated with each other. These correlations cause collinearity problems in the linear models but not in the MLP model. This may have contributed to the better performance of the MLP over the BReg. Another possible reason of why the MLP performed better than the BReg in the identification of RD is that the MLP can identify linear, non-linear, and interactional effects (for example, several combinations of the independent variables building cumulative risk) or a combination of these which the BReg cannot identify.

Important kindergarten factors for predicting adolescent RD
Among the three models, some consensus was reached on which factors are the most important for predicting adolescent RF or RC difficulties. Unfortunately, while the MLP model provides information on which are the most important factors for identifying RD, it does not provide information on whether the effects of these factors are statistically significant. Overall, three factors emerged as the most important for predicting RF difficulties from the MLP: RAN, gender, and letter knowledge (spring). The same three factors, with the addition of number counting (spring), emerged as the most significant from the BReg. For RC, RAN, letter knowledge (spring), and vocabulary emerged as the most important factors from the MLP. From the BReg, RAN, gender, vocabulary, reading words (spring), and maternal education level were the most significant. The LPA indicated that both RF and RC difficulties were more common in Profile 1 than in the other profiles. In Profile 1, children had low scores in all kindergarten cognitive skills, and they also had family risk for RD. For the LPA, the comparison with the profiles found that those belonging to Profile 1 were male, had family risk for RD and had significantly lower scores in all cognitive skills than those belonging to the other profiles.
Both the MLP and BReg suggested that RAN was an important factor in predicting RF and RC difficulties. This is in line with previous studies that have reported RAN's significance as a predictor of RF (Georgiou et al., 2016;Landerl et al., 2019), particularly in more transparent orthographies such as Finnish Kairaluoma et al., 2013). Considering the association between word reading fluency skills and RC (see also , Florit & Cain, 2011), and that RAN is among the best predictors of reading fluency (e.g., Eklund et al., 2018;Georgiou et al., 2016;Landerl et al., 2019), it is not surprising that RAN also emerged as an important predictor of RC. Even though decoding's predictive power drops over time (e.g., Florit & Cain, 2011), previous studies have reported that RF remains a significant factor in predicting adolescent RC (García & Cain, 2014;Manu et al., 2021;Stanley et al., 2018;Torppa et al., 2018). Thus, it is likely that RAN's effect on RC found in this study is mediated via RF (e.g., Eklund et al., 2018). In addition to RAN, letter knowledge emerged as an important factor in predicting RF difficulties by both the MLP and BReg, which is in accordance with several previous studies (e.g., Clayton et al., 2020;Manu et al., 2021).
Interestingly, BReg and LPA suggested that number counting may also be an important factor in predicting RF difficulties, which is noteworthy considering that number-counting skills usually are not included in predicting RF difficulties. However, studies that have included number counting in the prediction of RF skills have suggested that it could be a significant additional predictor (Bernabini et al., 2021;Koponen et al., 2016). Finally, vocabulary was found to be an important factor in predicting RC difficulties by both the MLP and BReg. This is also in accordance with previous studies suggesting that RC is heavily reliant on oral language comprehension skills, such as vocabulary (e.g., Petscher et al., 2018;Psyridou et al., 2018). Weaknesses in oral language comprehension can manifest years before a child learns to read, suggesting a possible causal link from oral language skills to later RC difficulties (e.g., Eklund et al., 2018;Petscher et al., 2018;Psyridou et al., 2018). Unlike previous studies, task-avoidant behavior Hirvonen et al., 2010), shared parent-child book reading, and the teaching literacy skills at home (e.g., Khanolainen et al., 2020;Mol & Bus, 2011;Silinskas et al., 2020a) were not identified as important factors. Similarly, unlike previous studies (e.g., Snowling & Melby-Lervåg, 2016;Torppa et al., 2011), neither the MLP nor BReg found family risk to be a significant predictor of RF difficulties in our data. One possible reason, at least for family risk, may be that it was related to poor cognitive skills and the inclusion of cognitive skills in the predictive models hide its effect on RD. As the LPA indicates, those belonging to Profile 1 and had the lowest scores in cognitive skills in kindergarten also had family risk for RD. Another possible reason could be the measure used to assess family risk in this study: a child was considered as having family risk if either the mother or father reported some or clear RD. This assessment method is not as sensitive as a more formal assessment, considering that in another Finnish sample, parental RD has been found to predict RF over and above children's cognitive skills (Torppa et al., 2011). Family risk may have emerged as a predictor among those in Profile 1 from the LPA because the LPA focuses only on similarities and differences in respect to the kindergarten-age factors and does not include reading skills. Finally, both the MLP and BReg identified gender as an important factor for RF and RC. Many previous studies have reported that females outperform males in reading and that more males are identified with RD than females (e.g., Psyridou et al., 2021;Quinn & Wagner, 2015).
This study has certain limitations that need to be addressed. First, we only used one measure to assess RC. A more extensive assessment could have yielded a more reliable assessment of RC and, subsequently, better predictions. Second, the assessment of listening comprehension was not optimal and the reliability estimate was quite low. Such low reliability could be one reason why listening comprehension did not emerge as an important factor for predicting RC difficulties. Thus, the results regarding the strongest predictors of RC should be interpreted with caution, as it is likely that they are underestimating listening comprehension's predictive value. Third, our sample had missing values. We balanced the imputed cases between the training and testing models by running the analysis multiple times, but not having missing values would have been optimal. The handling of missing cases did not follow the same approach across the three models. For the neural networks, we used single imputation (an expectation-maximization algorithm in SPSS) while for the linear model and the mixture model we used data without imputing the missing values. Imputation was conducted for the neural networks model because the MLP could not be performed with missing values. The linear and the mixture models on the other hand do not have such kind of limitation and can handle missingness, hence no need to use imputed data. For the linear and mixture models we used a full information maximum likelihood estimator (MLR estimator in Mplus). With this method we reach accurate standard errors while imputed data produces biases in standard errors. If we were to use imputed data for the linear and mixture models, we would have obtained equal parameter estimates to the MLR method used now but downward biased standard errors. This is because single imputation decreases the variability by adding the most plausible value to the data. Moreover, we think that it is important to use the way which is considered the best for handling missing data for each individual method as this is also the manner in which they are used in empirical papers that do not aim to compare methods against each other. Fourthly, the family risk variables were based on self-reports with a single question. This assessment may not have provided an accurate evaluation of parental reading skills and, thus, underestimating the predictive power of parental reading skills. In addition, RF or RC difficulty was defined as scoring in the lowest 10% of the RF or RC distribution, respectively. The selection of the cutoff matters as they are always somewhat arbitrary, and the outcomes of the studies might also be affected by the chosen cutoff . However, although the use of cutoffs is likely to lead to uncertainties in research findings because of measurement error (Branum-Martin et al., 2013;Francis et al., 2005;Psyridou et al., 2020), it is also a practical tool for the identification of children with RD. Internationally, different research groups use different cutoffs such as the lowest 10%, lowest 25%, 1SD below the mean, 1.5SD below the mean (e.g., Catts et al., 2012: 1SD, Esmaeeli et al., 2019Etmanskie et al., 2016: 25th percentile, Snowling et al., 2021. The large sample of the present study allowed the selection of a rather strict cutoff for the identification of RD. Finally, in this study we compared three models on their accuracy in predicting RD and we identified that the MLP was more accurate than both the LPA and BReg models. Future research should examine how other models (e.g., random forest, LASSO regression) compare with neural networks models.

Conclusions
Overall, the results of this study suggest that the neural networks model (MLP) was the most accurate method compared with the linear (BReg) and mixture (LPA) models for the early prediction of adolescent RF and RC difficulties. MLP was found to be superior even when compared with the combination of LPA and BReg. Thus, the results indicate that the use of deep neural networks for early RD identification purposes in the case of data with multiple interacting predictors looks promising. Possible reasons for the better performance of the neural networks could be that the neural networks models do not run into collinearity problems compared to the linear models. This allows the inclusion of a much broader set of predictors. In addition, the neural networks models can identify non-linear effects which the linear models cannot identify. This could also lead to the identification of more complex patterns in the data and as such better performance of the model. For example, the neural network approach could handle many more variables than the ones we have used here, e.g., more information about home, preschool, early school, or non-cognitive psychological factors, which holds a promise for even better prediction of RD. After careful development of a trained model, this method maybe could be used ultimately in schools to aid in early identification of children at risk of developing RD.
With respect to the further use of neural networks for predicting RD, some practical and ethical considerations must be made. First, when training models with many parameters on small sample sizes, it can be difficult to identify a solution that can be generalized to the larger population (Whelan & Garavan, 2014). Training the model in different data sets may be necessary to make it more generalizable (Durstewitz et al., 2019). As Appendices B and C show, the correlation between the training and testing models for RF and RC changes depending on the part of the data used. Second, machine learning entails using a trained model to make predictions for new inputs; thus, training the model is critical because later predictions will be only as good as the data used for training. Biased data, intentionally or unintentionally, can lead to biased results (Ouchchy et al., 2020;Stahl, 2021). As researchers, we must be aware of the models and data used for training. Third, while this type of modeling opens the door to predicting RD early, indicating potential for early support, this early prediction also increases the risk of students encountering discrimination, labeling, and being left behind due to an algorithms result. Although neural networks seemed to provide a more accurate prediction, it was still far from being 100% accurate; therefore, false positive and false negative classifications still occurred. However, accuracy could be improved by adding more data. In this study, we predicted RF and RC difficulties at one time point (Grade 9) using kindergarten-age factors. Assessing reading skills at more than one time point, particularly during the early phases of reading development, likely would improve the accuracy in predicting adolescent RD. Furthermore, the inclusion of a broader and more reliable battery of measures might help improve predictions of RD. However, even if prediction accuracy were to rise sharply, in practice, the neural networks model should serve merely as one tool among many when early screening and support decisions are made concerning individuals. We also must remember that early identification is useful and ethically fully justified only if it is followed by adequate support.
In conclusion, neural networks seem to be a valid method for identifying those at risk for RD early. MLP was more accurate than both the LPA and BReg models. In the past, linear and mixture models have been used most often for this purpose, but machine learning models, such as the neural networks model described in this study, seem to be more powerful, exceeding both the linear and mixture models as well as their combination.

Disclosure statement
No potential conflict of interest was reported by the author(s).