Developing evaluation tools for assessing the educational potential of apps for preschool children in the UK

ABSTRACT Selecting high-quality apps can be challenging for caregivers and educators. We here develop tools evaluating educational potential of apps for preschool children. In Study 1, we developed two complementary evaluation tools tailored to different audiences. We grounded them in developmental theory and linked them to research on children’s experience with digital media. In Study 2 we applied these tools to a wide sample of apps in order to illustrate their use and to address the role of cost in quality of educational apps. There are concerns that a social disadvantage may lead to a digital disadvantage, an “app gap”. We thus applied our tools to the most popular free (N = 19) and paid (N = 24) apps targeting preschoolers. We found that the “app gap” associated with cost is only related to some aesthetic features of apps rather than any observable educational advantage proffered by paid apps. Our study adds a novel contribution to the research on children’s apps by developing tools to be used across a wide range of audiences, providing the first description of the quantity of app design features during app use and evaluating the educational potential of free and paid apps.


Introduction
Touchscreen devices are increasingly popular among children under the age of 5 (e.g., Chen & Adler, 2019). An estimated 80,000 apps claim to be "educational" (Healthy Children, 2018) within the context of an unregulated market. Yet, there is a consensus among researchers that the majority of children's apps advertised as "educational" lack educational value and any foundation in research (Ólafsson, Livingstone, & Haddon, 2013). This means that informed decisions about which apps are high quality can be challenging for parents and educators (Livingstone, Blum-Ross, Pavlick, & Ólafsson, 2018) who could potentially benefit from an app evaluation tool based on early years learning theory. An app evaluation tool could also benefit app developers who want to ensure that the products they create include high-quality features.
As can be seen in Table 1, there are a number of limitations with the existing tools, some of which were identified by the authors themselves. Specifically, almost all the tools have a long list of criteria (18-70+ items) which makes app evaluation time consuming and not practical. The majority of the tools lack examples from children's apps that could allow an in-depth understanding of the descriptors. The descriptors of the items are often not specific enough; they include ambiguous or unclear terminology. Some of the tools also lack theoretical underpinning; they do not draw clear links to developmental theory. Only two of the tools had the content validity assessed, and none of the content validity assessments involved caregivers as participants.
Importantly, only three out of eight tools were aimed at caregivers. Given that preschool-aged children use touchscreen devices frequently (e.g., according to Ofcom (2019), children aged 3-4 years living in the UK spend 48 minutes per weekday playing games on a touchscreen device), it is crucial to help parents select good quality apps for their children. The majority of the tools have not been applied to a wide range of apps in order to demonstrate their use. However, the tools that were applied to a sample of apps did not allow for quantifying the app features during app use and were applied to math and literacy apps only. Moreover, some of the tools include subjective criteria, which is difficult to objectively measure by an adult. Therefore, there is a need for a new improved tool that could address those limitations.
The aim of this paper was to create two complementary evaluation tools (adapted to the needs of different audiences) assessing the educational potential of apps for preschoolers: (1) A thorough and user-friendly tool accessible by a wide audience: app developers, researchers, caregivers and educators; (2) A tool for researchers that could be used for a more in-depth evaluation by allowing to quantify app features during app use.
Based on the previous literature on app evaluation tools, we propose a set of principles that should guide the development of such tools: (a) Be informed by the developmental theory and research on children's learning in the context of digital media; (b) Draw clear links to previously developed tools; (c) Be brief, have a simple set of clearly described criteria and clear directions on the scoring system; (d) Focus solely on the objectively measurable factors; (e) Be applied to a wide variety of apps to demonstrate their use; (f) Be validated by conducting content validity and inter-rater reliability.
In building the content of our tools, we relied in particular on the British (Department for Education, 2017) and American (Early Childhood Learning and Knowledge Centre, Rating an app as 'low', 'medium', or 'high' on each of the four pillars, and on the learning goal.

No
Guided by the Science of Learning framework (Bransford et al., 1999

2015)
, early years frameworks, which state that preschool children's development should be supported in the areas of cognitive, academic, social-emotional and physical skills. In the following section, we identify key areas that an evaluation tool ought to include based on previous literature on app evaluation tools, developmental research and theory, and evidence of children's learning from digital media. We also outline a further set of quantity of app features indicators.
Not all of the previous app evaluation tools included the criteria related to meaningful learning and solving problems. We believe that these features are critical to the learning being deeper, authentic and transferable to real life.

Feedback
Feedback plays a critical role in supporting educational performance (e.g., Mulliner & Tucker, 2017;Schwartz, Tsang, & Blair, 2016). Specific, meaningful, timely and structured feedback drives child's engagement in the activity (e.g., Hirsh-Pasek et al., 2015;Walker, 2010). Moreover, feedback should reinforce the learning goal and scaffold users' understanding of how to improve (see, e.g., Callaghan & Reich, 2018). All the previous app evaluation tools pointed to the significance of feedback. However, not all of them described explicitly how feedback should be presented by providing relevant examples from the apps.

Social interactions
Social interactions support learning from the very early stages of development (see Hirsh-Pasek et al., 2015, for a summary). Social demonstrations enhanced learning in a touchscreen puzzle task in a group of 2.5-and 3-year-olds (Zimmermann, Moser, Lee, Gerhardstein, & Barr, 2017). Apps can involve "parasocial" interactions with animated characters present onscreen, which offer symbolic experiences that can be beneficial for children's social and cognitive development (e.g., Calvert, 2015).
Only some of the previous app evaluation tools recommended the presence of highquality parasocial interactions in the apps. In our tool we specify how the parasocial character should be interacting with the child in order to support learning.

Activity structure
Apps which give the opportunity for exploratory use alongside structured activities, might increase children's intrinsic motivation and engagement. Child autonomy and the sense of agency when using interactive media is crucial for the learning process (e.g., Kirkorian, 2018;Papadakis & Kalogiannakis, 2017). Pre-schoolers who could select their learning experience in a tablet game outperformed those who had no control over the order of presentation of the material (Partridge, McGovern, Yung, & Kidd, 2015).
Importantly, almost none of the previous evaluation tools allowed assessing whether apps promote exploratory use

Narrative
Media content that is embedded in an entertaining narrative integrated at the heart of the story can benefit children's learning (e.g., Dingwall & Aldridge, 2006). Content directly linked to a narrative of a television program is recalled better than content which is irrelevant to the storyline (Fisch, 2004).
Although the role of narrative for children's learning has been established by previous research, almost none of the evaluation tools included the presence of narrative in assessment criteria.

Language
Appropriately designed digital media can be a valuable source of language input for young children. The presence of good quality language is crucial for educational potential (Rowe, 2012). Studies using lab-designed apps have shown that children aged 2-4 are able to learn labels for novel objects (Kirkorian, 2018;Russo-Johnson, Troseth, Duncan, & Mesghina, 2017) or for real-world objects (Dore et al., 2019) While two of the previous evaluation tools mentioned language as part of some other criteria, none of them focussed on assessing the quality of language directly. We fill in this gap in our tool.

Adjustable content
To ensure effective learning, the difficulty level of an app should be automatically adjusted to users' performance (e.g., Callaghan & Reich, 2018). Specifically, each level of an activity should build on the knowledge gained in earlier levels, and increase hints and feedback if a user makes repeated errors (e.g., Revelle, 2013).
The majority of the previous tools included adjustable content in their evaluation criteria, and following the theoretical motivation outlined above, we also include it in our tool.

App design
As highlighted in the previous evaluation tools (e.g., Lee & Kim, 2015), app's design should be simple and consistent, style of letters and pictures should be clear, and the arrangement of operating buttons should be appropriate. Unnecessary advertisement, additional in-app purchases and slowly loading content may impede learning. App should also be easy to use and always responsive to touch interactions.
All the previous app evaluation tools included app design in their criteria. We also acknowledged its importance for enhancing children's learning experience.

Quantity of app features indicators
The following section presents the indicators for the quantity of app features. For certain features, it is crucial to estimate how often a given feature occurs during app use, in order to determine whether children's learning environment is age appropriate and not overly complex. None of the previous evaluation tools enabled measuring the proportion or frequency of different app features during app use. Thus, the way we measure app features in our quantitative tool is novel.

Touch gestures
The direct manipulation interaction facilitates pre-schoolers' learning from touchscreen media, yet most educational apps only support tap (99% of apps) and drag (56% of apps; Nacher, Jaen, Navarro, Catala, & González, 2015). Nacher et al. (2015) found that infants aged 2-3 perform one-finger rotation and two-finger scale up and down successfully, but find double tap, long press and two-finger rotation challenging. Russo-Johnson et al. (2017) reported that 2-4-year-old children from low SES families learned more novel object labels when dragging objects versus tapping them, perhaps because tapping is a response that does not require active attention.

Active learning
High-quality apps should provide opportunities for active cognition, e.g. making cognitively challenging decisions, and solving problems (e.g., Hirsh-Pasek et al., 2015). Cognitive activities in contrast to stimulus-reaction activities during app use encourage active cognition, while variability across learning encounters has a potential to facilitate learning (e.g., Thiessen, 2011). Thus, a variety of activity goals might contribute to the app being more cognitively active.

Complexity of the learning environment
Background visual, background sound and other app interactions available on the screen contribute to the complexity of learning environment. Cognitive Theory of Multimedia Learning (Mayer, 2005(Mayer, , 2014 envisions that the child's learning might be unsuccessful if the software includes too much extraneous material. Sound effects and animation interfered with story comprehension and event sequencing in children aged 3-6, when compared with paper books (see Reich, Yau, & Warschauer, 2016, for a review). Additional interactions present on the screen alongside the main task can decrease child's engagement in the app (Hirsh-Pasek et al., 2015).

Feedback
In addition to looking at feedback qualitatively and evaluating its meaningfulness, we can also look at it quantitatively and assess its occurrence in the app, its delivery method (audio, onscreen) and its content (ostensive feedback vs other feedback). Interactive media may enhance learning if they promote contingent responses or guide visual attention to relevant information on the screen (Kirkorian, 2018).

App design sophistication
Elements on the screen during app use can either be static, move in a static way, be fully animated or be partly static and partly animated. When learning challenging or novel information, pre-schoolers might benefit more from observing noninteractive video demonstrations than from using interactive media (e.g., Aladé et al., 2016). Furthermore, sound effects and animation in ebooks can interfere with story comprehension in children aged 3-6 years, when compared with paper books (see Reich et al., 2016, for a review).

The present studies
The present paper presents two studies. Study 1 focuses on designing and validating evaluation tools for apps aimed at pre-schoolers (children aged 2-5 years). In order to illustrate the use of our tools, in Study 2 we apply them to apps distinguished in terms of their cost.

Study 1: designing and validating the evaluation tools
Developing the questionnaire for evaluating the educational potential of apps

First stage: creating a list of items and developing a rating scale
Following the literature reviewed in the introduction we defined 12 concepts (items) to be measured in the questionnaire. We included three indicator descriptors to each item (together with a few examples from the apps to each indicator), such that the app could score between 0 and 2 points for each item. The 12 initially constructed items were: Learning goal, Going beyond rote learning, Solving problems, Feedback, Social interactions, Open-ended, Plotline/narration, Appropriateness of language, Customising, Adjustable content, Suitability of design, Usability.

Second stage: conducting a content validity study with experts
Once the first version of our questionnaire was designed, we conducted a content validity study. The study was approved by ethical review board at the University of Salford. We followed the procedure outlined by McGartland Rubio, Berg-Weger, Tebb, Lee, and Rauch (2003). We recruited three professional design experts (app developers) and three user experts (early years professionals) who shared their feedback on the items' representativeness, clarity and importance in an online survey. The raters were given the following instruction: "You will be presented with each of the 12 items included in our coding scheme. Please rate each item as follows: • Please rate the representativeness on a scale of 0-4, with 4 being the most representative. Representativeness is the extent to which each item measures the educational potential of children's apps. Space is provided for you to comment on the item or to suggest revisions. • Please indicate the level of clarity for each item (how clearly the item is worded), also on a four-point scale. Again, please make comments in the space provided. • On a scale of 1-10 please rate the importance of each item for measuring educational potential, with 10 being the most important.
Finally, please evaluate the comprehensiveness of the entire coding scheme by indicating items that should be deleted or added." We calculated the Content Validity Index (CVI) for each item and for the whole scale (based on its representativeness), following the guidelines described in McGartland Rubio et al. (2003). The CVI for each item was computed by counting the number of experts who rated the item as 3 or 4 and dividing it by the total number of experts. The CVI for the whole questionnaire was obtained by calculating the average CVI across the items. A CVI of at least 0.8 is recommended for new measures. All items in our questionnaire scored either 0.8 or 1, and the CVI for the whole questionnaire was 0.88 (see Table 2).
The raters did not suggest removing any items. They also rated all items high with regards to the items' importance. Consequently, based on the experts' suggestions, we made modifications to the questionnaire. We merged two pairs of items, i.e. Customising and Adjustable content became Adjustable content; Suitability of design and Usability became App design (according to the raters, the descriptions of these two pairs of items overlapped in terms of content). We also added additional examples from the apps to improve the clarity of the grade descriptors and we reduced the use of technical language in the questionnaire (including rewording some of the items' names, see Table 2).

Third stage: content validity study with caregivers
After introducing the changes to the questionnaire, we determined whether the tool was comprehensible to caregivers. We recruited six caregivers of children aged 2-5 years to rate the representativeness and clarity of each item and provide further comments. The caregivers were given the same instruction as the experts in the first content validity study. The CVI for the whole tool based on caregivers' ratings was high, 0.75 (see Table 2).
Based on the caregivers' comments, we made further modifications to the questionnaire. Most importantly, the participants from both content validity studies pointed out that while social interactions are important for learning, the development of skills for independent learning is also important and social interactions are not congruent with the reasons caregivers might choose apps (see Broekman, Piotrowski, Beentjes, & Valkenburg, 2016 for a similar argument). To accommodate this, in our tool we focused on the highquality parasocial interactions in the apps rather than interactions with adults during app use. Our evaluation questionnaire is presented in Table A1 in Supplemental materials.

Developing the coding criteria for quantifying the app features
In addition to the questionnaire that can be easily used by caregivers and educators, we also aimed to develop a tool allowing researchers a more in-depth, quantitative assessment of apps' features. For the coding criteria, following the literature review outlined in the introduction, we grouped the app features into five broader areas. Each of these areas contains between 1 and 3 coding criteria: (1) Touch gestures (2) Active learning (a) Activity goal (b) Activity type

Study 2: applying the evaluation tools to illustrate their use and to measure the app gap
In Study 2, we applied the evaluation tools to a sample of paid and free apps in order to illustrate their use and to assess the role of cost on app quality. Digital media is now embedded in family life (Livingstone et al., 2018) and as a result there are concerns that social disadvantage could extend to a digital disadvantage (Vaala, Ly, & Levine, 2015;Zhang & Livingstone, 2019), the so-called "app gap" (Common Sense Media, 2013). The app gap can be observed, for example, in the availability of devices to go online in the household, caregivers' digital skills and cost of devices (Zhang & Livingstone, 2019). Furthermore, lower socio-economic status parents might not be able to spend substantial quality time with their children (Department for Education, 2020).
It is important to understand whether there are differences between apps that might justify differences in cost. In the present study we focus on a broad distinction between apps that are free at the point of initial access versus apps for which payment at initial access is required. Parents might not be aware of the variety of factors contributing to the app cost (e.g., business decisions that influence app developers' app pricing strategies, including the size of the market, funding opportunities, app's unique selling point) and they might link the higher cost to higher quality of app.
According to the Department for Education research report (2020), children aged 0-5 years living in lower-income households in the UK use educational apps more often than their affluent peers. However, parents in higher-income households are more likely to pay for an educational app. It is therefore crucial to establish whether children from less affluent families are disadvantaged with respect to the quality of educational apps that they use.
To the best of our knowledge, to date only one study (Callaghan & Reich, 2018) investigated the differences between educational math and literacy free and paid apps. However, Callaghan and Reich (2018) did not investigate the frequency of app features during app use but limited their analysis solely to identifying whether or not a given feature is present in the app.

App selection
We coded 44 of the most popular apps in Google, Amazon and Apple app stores. To be included in this study, apps had to target children aged 2-5 years and feature in the top 10 lists for free and paid apps in each app store. Apps were identified on 7 June 2018. Of these 60 apps, 10 were removed as duplicates and 6 were excluded (5 video-based, which only allowed passive use, 1 unresponsive after installation). The remaining 44 apps were included in the study.

App use
Each app was downloaded and a screen recording was taken while the first author used the app for 5 minutes with a systematic approach to exploring all the features. The 5-minute sample was motivated by practical constraints in terms of the intensity of encoding of the detailed app features in ELAN (described in the coding section), as well as being more practical for caregivers and educators in appraising an app in an efficient amount of time, based on our evaluation questionnaire.
To maintain parity in approach to data capture across apps, the systematic approach by the first author was to follow all the activities in an order suggested by the app design and to use all the available features on each screen only once.

Questionnaire for evaluating the educational potential
Each app could score between 0-20 points on the educational potential index (between 0-2 points for each of the 10 items, see Table A1 in Supplemental material). 5-minute app screen recordings were assessed individually by the first and last author using the scheme. The discrepancies were discussed and resolved between the coders. Inter-rater reliability was high (κ = .889, p < .001). Internal consistency of the tool was Cronbach's alpha = 0.81, which indicates good internal consistency, further validating the tool.

Coding criteria for quantifying app features
To enable coding for quantifying app features, screen recordings of the app use were coded in ELAN 5.2, software that enables adding annotations to audio and/or video streams. The coder (first author) coded each screen during the app use for the 11 coding categories (see Appendix B in Supplemental material for the details on the coding and scoring). Inter-rater reliability was determined by comparing the coding of the primary coder with the coding of a trained double coder who coded data from 5 apps independently. Inter-rater reliability was κ = 0.917, p< .0001. The small number of discrepancies were resolved by the first coder.
Additionally, in order to determine whether the majority of app features could be captured in 5 minutes of app use (regardless of the person using the app and their style of app use), we calculated inter-user reliability. This was determined by comparing coded app use data for 5 apps that were also used by a second independent user. Crucially, the second user did not receive any instruction on using the apps. Overall, inter-user reliability was κ = 0.872, p < .001, which shows that the same app features can be captured during 5 minutes of app use, regardless of the user.

Results
To illustrate the use of the tools in practice, we report differences between free and paid apps. This also enables us to determine whether there is an app gap in quality that is reflected in cost, which could contribute to a digital disadvantage. The final sample included 19 free and 24 paid apps (one app was excluded because it was duplicated between two app stores and was listed as free in one store but required payment in the other).
We first report the results from the analysis of the questionnaire for evaluating the educational potential, and then the analyses of coding criteria for quantifying the app features.

Evaluating the educational potential
To test whether there is a difference in educational potential between free and paid apps, a Mann Whitney U-test was performed. The results show that free apps (M = 7.16, SD = 3.70) did not differ from paid apps (M = 6.75, SD = 4.60) on the educational potential index (U= 211, Z= −0.405, p= 0.685, r = −0.06). Figure 1 presents cumulative scores for each of the items in the evaluation questionnaire for the whole app sample (0-2 points for each item, 43 apps in the sample; maximum score was 86). Suitability of design and quality of language received the highest scores (58 and 54, respectively), while adjustable content and social interactions appear among those with the lowest scores (8 and 13, respectively).

Quantifying app features: analyses comparing free and paid apps
First, we present the descriptive statistics for app features coded in the study (see Table 3).
The analyses comparing free and paid apps are presented in Table 4. Overall, the free and paid apps differed significantly only on two features: (1) the mean number of screen elements, with paid apps having on average more screen elements than free apps; and (2) on object property, with free apps having higher frequency of animation than paid apps, but no differences in other object properties between free and paid apps.

Discussion
The primary aim of this paper was to report the design and development of two novel, transparent and comprehensive tools for evaluating the educational potential of apps aimed at 2-5-year-old children. Specifically, a questionnaire aimed at a wide audience, and coding criteria for measuring the quantity of app features aimed at researchers.
The tools were developed specifically for evaluating apps targeting pre-schoolers; they were guided by the early years foundation frameworks and informed by the developmental theory and research on children's learning from digital media. The development of the tools was preceded by a careful analysis of the previously designed evaluation tools. We identified several limitations in the previous tools, such as a long list of criteria which are not specific enough, no direction to quantify app features, and inclusion of technical language. We designed our tools with the aim to address those limitations. We also demonstrated the use of our tools on a wide range of most popular children's apps. We added a novel contribution to the research on children's apps by evaluating both the educational potential of apps and by providing the first description of the quantity of app design features during app use.
Our tool is the first to have had content validity assessed by caregivers as well as experts. We made further amendments following comments from caregivers to ensure that our tool did not include technical language. The use of examples from existing apps in our tools means that users do not require any existing knowledge of early years education frameworks which was a common limitation of previous tools (Department for Education, 2019;Hirsh-Pasek et al., 2015;Lee & Sloan Cherner, 2015). The next step in validating the tools will be to determine how preschool children interact with the apps and evaluate, rather than predict, the educational potential of the children's interactions.  Free apps higher frequency of animation (p< 0.0001) than paid apps A further point for future investigation is also how the various features in apps interact with one another. This is ongoing work in our lab. Our tool development resulted in a measurement of apps in terms of an educational potential index, which was shown to be high in content validity, internal consistency and in inter-rater reliability. The comparison between free and paid apps on this index did not reveal any difference between apps. It is worth noting that the mean scores on the educational potential index for both groups were rather low (on average less than 10 out of 20). This suggests that the free and paid apps appeared to be equally low in terms of their educational potential, which is consistent with other studies underlining the disparity between the number of self-proclaimed educational apps in the markets and their poor educational value (Chau, 2014;Goodwin & Highfield, 2012;Hirsh-Pasek et al., 2015;Papadakis et al., 2018;Vaala et al., 2015).
The whole app sample showed strength as far as suitability of design and language were concerned (see Figure 1). High scores on suitability of design suggest that the apps were well prepared from the technical perspective. However, the apps showed weakness in terms of the more educational evaluation criteria, such as meaningful learning, offering users problems to solve or having a learning goal, which suggests that they do not offer a meaningful and cognitively active learning experience (in line with Papadakis et al., 2018). The apps in our sample also scored low on social interactions; they rarely encouraged high-quality interactions with characters onscreen (in line with Papadakis et al., 2018;Vaala et al., 2015).
Additionally, the apps in our sample scored particularly low on adjustable content. This means that they lacked flexibility in changing the settings and did not tailor content to users' performance. Apps should adjust the content to the user's needs if they intend to increase user's motivation and allow for gradual progress in learning (e.g., Callaghan & Reich, 2018;Papadakis & Kalogiannakis, 2017). This finding is again in line with the previous studies, which found that less than 20% (Callaghan & Reich, 2018;Vaala et al., 2015) or none of the reviewed apps (Papadakis et al., 2018) included adjustable content. Overall, our findings highlight the need for developmental psychologists to work with app developers to advance the educational potential of touchscreen apps.
As a secondary aim, we compared the free and paid apps on the coding criteria for quantifying app features in order to assess the "app gap" associated with app cost. The free and paid apps differed only on two features: (1) the number of screen elements, with paid apps having on average more elements on the screen than free apps; and (2) the frequency of animation, with free apps having more animations than paid apps. Considering that only two differences were observed, it can be concluded that free and paid apps did not differ substantially either in their educational potential or in their features and design. This is partially in line with the content analysis of Callaghan and Reich (2018) who also did not find many differences between free and paid apps with respect to their educational features. Our results suggest that paid apps might not necessarily guarantee a better app quality than free apps, at least based on our app sample.
This study also gives an insight into the educational quality and design features of apps targeting pre-schoolers. Crucially, none of the previous app evaluation reviews quantified the apps features during app use within the evaluated app sample. Thus, our descriptive statistics (see Table 3) are the first ones to present the frequency of various app features during app use, based on a wide sample of apps. In our sample, all apps had higher frequency of cognitive activities than stimulus-reaction activities. Complex sound (two or more sounds playing simultaneously) was more frequent across all the apps than simple sound, which might add to the young children's cognitive processing load while using apps (e.g., Mayer, 2014). The apps had on average 5 screen elements on each screen, and 18 different activity goals during the 5-minute use. On each screen, apart from the target interaction, there were on average 2 additional interactions available. Apps in our sample offered a high proportion of feedback to users' responses (78%), as compared to no feedback during app use (see Callaghan & Reich, 2018, for similar results), and a high proportion of that feedback was ostensive (74%), i.e. referential cues to indicate what is to be learnt. Those characteristics can serve as a reference point for other studies on app features.

Conclusion
In conclusion, we have presented comprehensive evaluation tools based on theories of learning and cognitive development and have shown how they can be implemented in the analyses of apps available to children. We found that the app gap associated with cost was not an issue in terms of the educational potential for most popular apps currently available. The app gap is instead related to aesthetic features of apps rather than any observable cognitive advantage proffered by paid apps. Notes 1. We use the term "evaluation tool" to refer to rubrics, frameworks and schemes for consistency throughout the paper. 2. The cumulative scores were not presented separately for the two groups due to the differences in sample size between the groups. We also did not present mean scores for each item for the two groups because each item was measured only on a scale 0-2.