Automatic Speaking Assessment of Spontaneous L2 Finnish and Swedish

The development of automated systems for evaluating spontaneous speech is desirable for L2 learning, as it can be used as a facilitating tool for self-regulated learning, language proficiency assessment, and teacher training programs. However, languages with fewer learners face challenges due to the scarcity of training data. Recent advancements in machine learning have made it possible to develop systems with a limited amount of target domain data. To this end, we propose automatic speaking assessment systems for spontaneous L2 speech in Finnish and Finland Swedish, comprising six machine learning models each, and report their performance in terms of statistical evaluation criteria.


INTRODUCTION
Technology has opened the door to new possibilities in language testing and assessment.Machine learning provides a means to automatize L2 proficiency testing, but to this day applications have been more common for written than spoken language.Developing automatic systems for assessing spontaneous speech has become highly desirable in the context of L2 learning because they promote and democratize self-regulated learning, as well as serve as a facilitating tool in language proficiency assessments and teacher training programs.Such systems are typically developed for languages with a large number of learners due to the abundance of training data, yet languages with fewer learners such as Finnish and Swedish remain at a disadvantage due to the scarcity of required training data.Nevertheless, recent advancements in the field of AI manifested in self-supervised machine learning methods (Al-Ghezi et al., 2021;Devlin et al., 2019) make it possible to develop automatic speech recognition (ASR) systems with a reasonable amount of training data, which makes it feasible to develop automatic speaking assessment systems for underresourced languages.
This article describes the development steps of the first prototype of an automatic assessment system for spontaneous L2 Finnish and Finland Swedish speech and reports the initial evaluation of the system.In addition to supporting self-regulated learning purposes, the tool could be used in a national school-leaving exam for scoring.

BACKGROUND
Automatic assessment of non-native speech started with the focus on segmental pronunciation in read speech (Bernstein et al., 1990).Read speech has been favored due to its predictability in eliciting speech from different speakers, leading to greater accuracy in speech recognition and modeling of pronunciation features.The more unconstrained the speaking tasks are, the more unpredictable the elicited speech becomes, which causes problems for automatic scoring.The ability to produce spontaneous speech is, however, essential in human communication, and therefore researchers started to study the automatic assessment of spontaneous speech in the early 2000s, focusing 2000s, focusing on fluency features (Cucchiarini et al., 2002).Here we use the term spontaneous in comparison to read or imitated speech, although all responses to instructed tasks are still predictable to some extent.
Validation of automated speaking tests should start with identifying the role of automated scoring in the context of use (Xi, 2021).In the current study, we investigated the relationship between machine and human scoring to pave the way for their future combination in the scoring of learners' speaking performances in the ME.For pedagogical training purposes, machine scoring may serve as a useful tool to promote learner autonomy and self-efficacy.Our focus in this study is to establish a validity argument for using automated scoring in conjunction with human scoring in the future.Validation as an argumentation chain has been developed in a number of seminal works (Bachman & Palmer, 2010;Chapelle et al., 2011;Downing & Haladyna, 2006), and more recently by Xi (2021) specifically addressing validation of automated test administrations.
A few key steps are followed in this investigation including defining the construct and language use domain, designing the tasks drawing on the national core curricula, and monitoring the accuracy of the ASR and human-machine score alignment to explain and evaluate the quality of the observed score as an accurate indicator of students' speaking performance.Particular attention is paid to the relevance and accurate measurement of the construct when subjected to automated evaluation, a crucial aspect as voiced by Xi (2021).
Developing a validity argument involves attaining an acceptable balance between competing aspects.Automated tests increase the reliability in terms of the consistent scoring of test takers, test takers, which is vital in high-stakes assessments.On the other hand, the current limitations of automated task types would compromise the authenticity and construct representation of the assessment.The constrained task types that elicit read speech or short predictable phrases narrow the construct of oral competence.However, developments in the automated evaluation of more spontaneous speech samples are promising in this regard (Xi et al., 2008).Free speaking tasks, such as speaking on a given topic, picture description, and responding to verbal inputs, require learners to produce open-ended responses.Compared to constrained task types, free-speaking tasks are considered more authentic and better aligned with communication-oriented oral constructs.For such tasks, it is possible to use a task-specific language model with a vocabulary constrained to the task domain to improve the performance of the ASR.
Using tasks that generate both scripted and spontaneous speech enables a more comprehensive evaluation of learners' speaking skills: Mechanical tasks such as read-aloud and sentence repetition can be used for measuring processing speed (Van Moere, 2012) or specific pronunciation features such as stress production (Kallio et al., 2020(Kallio et al., , 2022)), while tasks eliciting unconstrained speech also enable the assessment of lexical, grammatical, and cohesion skills (Winke & Brunfaut, 2020).It is noteworthy, however, that the current automated speaking tests cannot cannot cover all aspects of L2 speaking proficiency: Current tests mainly include monologic tasks, which do not assess some of the higherorder skills such as interactional competence (Xi, 2010).
To date, several automated systems for semi-direct spoken L2 assessment have been introduced.The dominant automatic assessment systems have been developed for English as a second or foreign language (Educational Testing Service, n.d.;Pearson, 2017;Xi et al., 2008;Xu et al., 2021), which often use massive data sets, enabling advanced and elaborate machine learning.Versant has also developed tests for Spanish, Dutch, French, and Arabic, the last of which could be classified as a low-resourced language (Pearson, 2023).The target languages in the present study, Finnish and Finland Swedish as L2, are also considered lowresourced languages.Due to data scarcity, the possibilities to develop automated L2 speech assessments for these languages have been limited compared to the well-known systems for English.Research on features related to oral proficiency in Finnish and Finland Swedish has also been quite rare (Kallio et al., 2017(Kallio et al., , 2020(Kallio et al., , 2021(Kallio et al., , 2022(Kallio et al., , 2023)).
Gu and Davis (2020) described a more comprehensive, automated diagnostic feedback system developed for spontaneous speech.In addition to reporting proficiency level, the system aims at providing both analytic and holistic feedback to the learners (von Zansen & Heijala, 2023;von Zansen & Huhta, 2022).Similarly, the Versant tests for Arabic and English (Pearson, 2018(Pearson, , 2020) ) report an overall score and four diagnostic scores while the Linguaskill Speaking Test (Xu et al., 2020) provides a proficiency level as feedback to the learner.Automated feedback is beyond the scope of this paper as we will focus on automated assessment of vocabulary, grammar, pronunciation, and fluency in developing a comprehensive spoken assessment system.
A combination of human and machine scoring seems to be the most appropriate alternative to human scoring to date (Evanini & Zechner, 2020), since some speech features, such as pronunciation and fluency (particularly in short phrases) are measured more accurately by a machine, whereas more extended spontaneous speech, let alone complex verbal interactions, are most appropriately left for humans to judge.Furthermore, strong correlations between machine-scored and human-scored tests have been found (Bernstein et al., 2010), which is promising for using machine scoring in large-scale assessments.Bernstein et al. (2010), however, compared scorings from two distinct constructs: Automated scores Automated scores were derived from test-takers constrained speech, while human scores were from interactive speech tasks.

L2 automatic speech recognition
In the context of automated spoken language assessment, the development of highperformance ASR systems is crucial, and for this purpose, large amounts of transcribed speech data should be used for training.Unlike languages with more learners such as English and Spanish, adequate training data for languages with fewer learners such as Swedish and Finnish may not always be feasible.
Due to data scarcity, low-resourced L2-ASR systems are often developed using a pipeline ASR paradigm, where custom-engineered solutions are applied at each stage of the pipeline to improve performance.For instance, in acoustic modeling, each word has multiple pronunciations to accommodate the mispronunciation of words by L2 speakers.In some cases, custom solutions for acoustic and language modeling require specialized language expertise and are cost ineffective.Therefore, developing end-to-end L2-ASR systems that do not require separate pronunciation or external language modeling is highly desirable in the context of L2 ASR for low-resourced languages.However, developing end-to-end ASR systems requires a large amount of labeled (transcribed) data, which is not always attainable.
Self-supervised learning (SSL) has emerged as an effective technique to bridge this gap.The key idea is to learn general representations in settings where large amounts of unlabeled (untranscribed) source data are available, thereby leveraging them to improve the performance of downstream target tasks with limited amounts of labeled data.This is especially interesting for tasks that require considerable effort to obtain labeled data, such as speech recognition.In this work, an SSL acoustic model called Wav2Vec2 (Baevski et al., 2020) was incorporated in an end-to-end ASR pipeline for both Finnish and Finland-Swedish.

Scoring features in automated speaking assessment systems
In this section, we provide some background In this section, we provide some background for designing an automatic speaking assessment system for spontaneous L2 speech of Finnish and Finland Swedish is provided.In many L2 tests and studies, the main aspects of speaking proficiency relate to speech fluency, pronunciation, vocabulary, and grammar (Baker-Smemoe et al., 2014;Educational Testing Service, n.d.;Gretter et al., 2019;Kallio et al., 2022Kallio et al., , 2023;;Kang & Johnson, 2018;Pearson, 2017Pearson, , 2018Pearson, , 2020)).Machine-derived measures also cover these dimensions of speaking proficiency, and although L2 proficiency is a broad concept, these measures have proved to be good predictors of overall oral proficiency (Baker-Smemoe et al., 2014;Cox & Davies, 2012;Kang & Johnson, 2018).Further explanations are provided below on the dimensions of speaking proficiency regarding measures that can be integrated into automated systems assessing spontaneous L2 speech.
First, vocabulary and grammar are two crucial dimensions in assessing L2 speaking.Lexical diversity or range refers to the learners' sophistication in vocabulary use (Lu, 2012).Range has been commonly measured using features like "number of different words/ tokens" and "type-token ratio" (Zechner & Evanini, 2019).Other variant features include the OVIX lexical diversity measure (Hultman, 1994) for automated assessment of L2 Swedish (Östling et al., 2013).Grammatical accuracy has been quantitatively measured using a set of syntactic features derived from automated text annotation tools such as partof-speech tagging and constituency and dependency parse trees (Östling et al., 2013;Zechner & Evanini, 2019).
Research on automatic assessment of spontaneous L2 speech has initially focused on fluency features measured by temporal features, some of which are associated with strong fluency and others weak fluency (Cucchiarini et al., 2002).A ternary division by Tavakoli and Skehan (2005) introduced three components related to speech fluency: (1) speed fluency, generally measured as speech rate, articulation rate, or mean length of syllables; (2) breakdown fluency, generally measured as the frequency, length, and/or relative amount of silent and filled pauses in an utterance; utterance; and (3) repair fluency, referring to the partial or full repetition of words, syllables, or entire phrases, false starts, or reformulations.entire phrases, false starts, or reformulations.These three dimensions have guided research on L2 fluency, and measures of speed and breakdown fluency, in particular, have been found to correlate with assessments of fluency (Bosker et al., 2013;Cucchiarini et al., 2002;Derwing et al., 2004;Kormos & Dénes, 2004;Lennon, 1990;Préfontaine et al., 2016) and oral proficiency (Iwashita et al., 2008;Kallio et al., 2017Kallio et al., , 2022;;Kang & Johnson, 2018).Recently, researchers have also called for accounting for the locations of pauses with respect to syntactic constituents in modeling fluency (De Jong, 2018;Hsieh et al., 2020;Kallio et al., 2022).
As for pronunciation, automated assessment is generally based on acoustic model scores or phone likelihood measures (Hsieh et al., 2020;Loukina et al., 2017).A phone from the output of a speech recognizer is compared to the corresponding phone in a pronunciation model trained on a large corpus of (generally) native speech.The more the L2 pronunciation differs from the model, the lower the assumed proficiency.Automatic measures of prosodic features aim to capture the relevant rhythmic and tonal properties of speech, such as realizations of prominence and intonation (Kang & Johnson, 2018).

DATA COLLECTION AND PROCESSING
L2 speech corpora for training the automatic speaking assessment systems were human labeled.First, diverse sets of Swedish and Finnish-speaking tasks were designed.Second, grading rubrics were developed to evaluate L2 learners' proficiency on multiple dimensions.Next, learners' speech samples were collected, manually transcribed, and scored by human raters using the rubrics.Finally, data were selected for our machine-learning experiments.

Test format and task design
Four speaking tests were used: one for L2 Finland Swedish, developed in an earlier project (Karhila et al., 2016), and three for L2 Finnish, developed in the current project (von Zansen, 2022a(von Zansen, , 2022b(von Zansen, , 2022c)).All tests were computer delivered computer delivered and included both read-aloud and freeform tasks.This study focused on the freeform tasks including semi-structured and open-ended ones (Luoma, 2004).The original descriptions of the freeform tasks are listed in Tables A1 and A2 of Appendix A for Finland Swedish and Finnish, respectively.
The Swedish-speaking test included a total of six tasks with subtasks designed to cover various dimensions of speaking proficiency, differing in formality and complexity.In addition to the read-aloud tasks, the test included three other task types: situational situational reacting tasks that involve reacting to different situations based on written prompts (10 seconds time limit/response) or a picture with text clue (30 seconds time limit/response); a simulated video phone call with pre-recorded questions and replies from a native speaker of the target language (10 seconds time limit/answer and 20 seconds time limit/question); and a live dialogue task with a peer.
The three Finnish-speaking tests were delivered using Moodle's Quiz module.Due to the COVID-19 pandemic, the data were collected remotely by giving instructions via Zoom and Teams (see von Zansen et al. (2022a), and von Zansen and Hilden (2022)).
First, two test versions targeting B1 (von Zansen, 2022c) and B2 (von Zansen, 2022d) levels were designed following the goals, contents, and target-level descriptions of the National Core Curriculum (NCC) for upper secondary education (Finnish National Agency for Education, 2015) and (Finnish National Agency for Education, 2019).Both tests included read-aloud and production tasks (semi-structured and open-ended) with a 40-minute time limit.In the semi-structured tasks, the learner was, for example, asked to briefly reply to a comment in a webinar or answer a question during a simulated phone call (time limit 15 seconds/response).The open-ended tasks included talking about a given topic for 1 minute and describing and comparing pictures.The open-ended topics ranged from the B1 version's everyday life situations to genetically manipulated foods in B2 (see also von Zansen et al. (2022b)).
Second, the A-level speaking tasks (von Zansen, 2022a) were developed in cooperation with the university teachers teaching beginner level Finnish courses to match the course learning objectives.However, as an important starting point for the A-level task design, B1 and B2 speaking tasks were used.This approach was justified since the usefulness of the tasks targeting B-level speakers were investigated from the perspectives of human rating (von Zansen, Kallio et al., 2022), language learners' perceptions (von Zansen et al., 2022b) and functioning of the rating scales (von Zansen & Huhta, 2022).As a result, the A-level tasks included read-aloud and semi-structured and open-ended tasks (see also von Zansen and Hilden (2022).

Rating scales
Since the current NCCs (Finnish National Agency for Education, 2015, 2019) highlight communication and interaction skills, which cannot be properly measured automatically, the level descriptors from the previous NCC scale (Finnish National Agency for Education, 2003) were applied, which describe speaking skills in more detail, for scale validation see (Hildén & Takala, 2007).The Finnish NCC scales are local applications of the CEFR (Council of Europe, 2001) and divide the CEFR levels into two or even three sublevels.For example, level A2 is split into A2.1 and A2.2.However, when piloting the scales, the need to simplify the rating of learners' overall speaking proficiency was noted noted.As a result, only the main levels A1-C1 were used with no further division into sub-classes.In addition, C2 was added to the scale using the CEFR descriptors, since the NCC only goes up to C1.In addition, an extra level below A1 was included, resulting in a 7-level rating scale.The rating scales can be found in von Zansen (2022a).
For the rating of specific dimensions of speaking, five analytic scales were designed by selecting key descriptors for each dimension from the overall NCC scale (Finnish National Agency for Education, 2003).To simplify the raters' task, the analytic scales were shorter (three levels/points) and simpler (few descriptors per level; see Table B1 in Appendix B).As an initial step in validating the scales, feedback was gathered from the raters in an earlier phase of the project, which led to the addition of a fourth point to some of the scales to allow finer distinctions to be made.The final analytic dimensions were task completion (3 points), fluency (4 points), pronunciation (4 points), range (3 points), and accuracy (4 points).

Speech data and human ratings
In this study, recordings from non-native Finland Swedish (n = 181) and Finnish (n = 325) speakers were used.The Swedish data were collected from upper secondary school students in 2015 (Karhila et al., 2016), while the Finnish data were collected in 2021 from upper secondary school (von Zansen et al., 2022b) and university students (von Zansen & Hilden, 2022).Table 1 presents the detailed characteristics of the two datasets.
The participants took the test either in classroom environments using school headsets or at home using their own microphones.As a result, the recordings varied varied in terms of audio quality and the amount of background noise.Some of them were rejected by human transcribers before starting the rating process.
Human raters were recruited and trained to assess the collected speech samples by using one holistic and five analytic rating scales.The analytic scores were given independently after giving the holistic score.The Swedish recordings were assessed by 18 raters in 2020 and the Finnish recordings in 2021-2022 by 26 raters (for details concerning human ratings, see von Zansen, Kallio et al. (2022)).The raters participated in thorough online training, after which they rated the samples between December 2020 (Swedish) and April 2022 (Finnish).The raters proceeded by scoring one sample at a time: that is, at a time: that is, rating all the dimensions for one sample before moving to the next sample.
An overlapping rating design was used, with most performances rated by at least two raters, which allowed the ratings to be analyzed by Facets (Linacre, 2020) to check for the quality of the ratings and the rating scales (von Zansen & Huhta, 2022).Fair average values for each sample were used to represent human ratings instead of raw means as they were adjusted for rater severity.
The analytic scales were further validated in our other study, which showed that the raters and the rating scales functioned as expected (von Zansen & Huhta, 2022).Furthermore, the Facets analyses of the ratings reported in the study demonstrated the scales functioned adequately when there were enough samples per scale point.

Data preparation
In this work, only the samples with a human score for every rating criterion (no criterion was marked as "non-gradable" by any of the raters) were used to eliminate potentially problematic samples.In addition, only tasks that generated freeform speech were analyzed.After filtering, the size of the Finland Swedish and the Finnish datasets was reduced to 1,542 1,542 recordings (5.5 h) and 2,112 2,112 recordings (14.1 h), respectively.
Most samples in the remaining recordings were rated by at least two human raters.In the Finland Swedish subset, 1,360 1,360 out of 1,560 1,560 speech samples were rated by two raters, 42 by three raters, 39 by five raters, and the final 101 recordings by one rater.In the Finnish subset, 1,785 1,785 out of 2,112 2,112 recordings were rated by two raters and 288 by 1 rater.In addition, there was a control set of L2 Finnish recordings which was distributed to all raters.As a result, three recordings were rated by 24 raters, and 17 samples were rated by 25 raters.
In this research project, a partial overlapping rating design was used to save resources and ensure the quality of ratings (more details are provided below in Speech data and human ratings).To combine the ratings from several raters into a unique score per rating criteria, the scores were averaged between raters and then rounded.
For evaluation, several factors should be considered.First, there was a limited amount of data.In addition, our data were heavily imbalanced in terms of human ratings (see Figure 1) (see Figure 1).For example, the ratings of most of the samples centered around level 2 for both languages.Moreover, the diversity of tasks and corresponding experiments should be taken into account: For each language, systems for ASR, as well as for speech rating on the holistic scale and classification on the analytic dimensions are needed.As a result, it was not possible to design a universal test set that would showcase real model performance in all our experiments.Therefore, cross-validation (CV) was used with the data split by speaker into four folds with no overlap between folds.One fold was used for testing in each training iteration of the 4-fold CV and the predictions on the test folds of the four models were aggregated when running the evaluation on the entire dataset.

SYSTEM DESIGN
Our main goal was to develop a system to help teachers assess spontaneous L2 Swedish and Finnish short utterances in an automatic or semi-automatic fashion.The automatic assessment system, shown in Figure 2, includes a Wav2vec2-based ASR model (Al-Ghezi et al., 2021) and five main scoring models that worked concurrently to produce a score for each analytic dimension (task completion, lexico-grammatical competence, pronunciation, and fluency) and predict the overall spoken language proficiency level.Each scoring model predicted individual scores using a set of textual and acoustic features (more details in Table 5).In addition to human-designed measures, deep acoustic embeddings from the ASR (Hidden Representations in Figure 2) were extracted.The task completion scoring model served two purposes: to filter out responses that do not pertain to a task, and to evaluate the content of a response.

Research question 1: ASR
For ASR experiments, publicly available 12 pre-trained Wav2Vec2-Large (317 M parameters) models were used and fine-tuned fine-tuned using the L2 data.For Finland Swedish, the pre-trained monlingual Wav2Vec2 model fine-tuned fine-tuned by the data lab at the National Library of Sweden was used. 3The 11.5K hours pre-training data include unlabeled speech from Swedish local radio broadcasts and audiobooks, while the labeled finetuning data is composed of Swedish Common Voice (Ardila et al., 2020), Nordisk Sprakteknologi (NST) and Swedish local radio recordings.
For Finnish, a multilingual Wav2Vec2 model pre-trained on the Uralic (Finnish, Estonian and Hungarian) part of the VoxPopuli corpus (Wang et al., 2021) was used which consists of 400K hours of unlabeled speech from European Parliament plenary session recordings.The Uralic subset includes in total 42.5K hours of recordings, out of which 14.2K hours are Finnish speech.Before adapting the model to our target data, it was finetuned on a transcribed 100-hour subset of the Lahjoita puhetta (Donate Speech) corpus (Moisio et al., 2023) which consists of colloquial Finnish speech.
The models were finetuned by following the 4-fold CV setup described in Section Data preparation, resulting in 4 sub-models for each language.Each of them was trained for 20 epochs with a learning rate of 1e-4 and an effective batch size of 4. Table 2 summarizes the results of the ASR experiments.After aggregating the test results of the sub-models, 17.71%/ 9.08% WER/CER and 21.89%/7.06%WER/CER were obtained on the entire data for L2 Finland Swedish and L2 Finnish, respectively.Model predictions for ASR error analysis were made, and the findings are reported in the section Error analysis of ASR.In this study, character error rate (CER) was used in addition to WER, because in long words the CER reflects better the number of completely wrong words compared to small errors or mispronunciations.This is particularly relevant for agglutinative languages such as Finnish where the words are often quite long.

Research questions 2 and 3: results of the scoring models
As discussed in the Data preparation section and shown in Figure 1, some rating dimensions included empty or heavily underrepresented categories, and modeling them for an automatic scoring system is impossible and impractical.Thus, the training and the evaluation were limited to sufficiently represented ones by cutting thin tails from the distributions.For example, for the Swedish holistic classifier, ratings from 2 to 5 were retained and the rest removed, since only 36 out of 1,542 recordings were in these removed categories.For the same reason, accuracy and range were aggregated into one analytic dimension, lexicogrammatical competence.The updated ranges for each dimension are reported in Tables 3  and 4. It should be noted that the values of the evaluation metrics were not comparable between various dimensions and between different datasets that may have different scales.
Each scoring model (see Figure 2), except for task completion, was a 6-hidden-layer neural classifier of 300 hidden units optimized by Adam optimizer with a learning rate of 1e-3.The models were trained for 600 epochs with a batch size of 100.
The task completion model served two purposes: filtering filtering and content evaluation.In the first step, the model checked if the transcript belonged to the predefined task.It Columns represent the developed models, the amount of finetuning data used, as well as corresponding WER and CER.
used the cosine similarity metric to compare the embedding of the transcript to the centroids of each task and returned a binary output indicating whether the transcript belonged to the closest centroid.In the second step, the model compared a transcript to other responses of the same task and assigned it a score of its closest neighbor.Many tasks in our dataset had very imbalanced score distributions.For example, if the task scores were put into three bins, one Swedish task would contain 92 responses in the highest score bin and only two responses in the lowest score bin.Choosing more than one neighbor would leave our system no chance of giving out the underrepresented score interval.For our procedure to be successful, the vector spaces needed to have the following properties.The task classification space should keep responses to the same tasks close to each other and far away from other responses.The space for content scoring of tasks should keep vectors of responses with similar score ranges close to each other and far away from responses in other score ranges.To get such vector spaces, monolingual BERT (Devlin et al., 2019) models were first fine-tuned in a Siamese manner (Reimers & Gurevych, 2019) to cluster responses from the same task together, provided with positive and negative examples of responses to the same task.For each response one positive pair and five negative pairs were formed.The negative pairs were chosen from responses to tasks other than the task of the response that was closest to the current response in the vector space.
The models were trained for five epochs using Contrastive Loss (Chopra et al., 2005) with a margin of 0.5 and cosine similarity as the similarity metric, and representations were obtained with mean-pooling.Then, the resulting models were fine-tuned to place responses with similar scores together.To achieve this, for a response in our dataset, one positive and five negative examples were sampled from the same task as this response.Positive examples were responses that received the scores from the same score bin.Negative examples were responses from other score ranges.The Swedish model was trained for five epochs, and the Finnish model was trained for two epochs using Contrastive Loss, a margin of 0.5, and cosine similarity as the similarity metric.Tables 3 and 4 compare the results of human-human and machine-human evaluations of the five main scoring models for L2 Finland Swedish and L2 Finnish, respectively, using Weighted quadratic Kappa, Spearman's correlation, and mean absolute error (MAE) in addition to Precision (P), Recall (R), and F1 scores.It should be noted that the machinehuman and human-human measurements were not exactly comparable, because humanhuman comparison was only possible on a smaller random subset that had more than one human score.
Table 5 shows the top-performing expert-designed features.While some measures were important for one language only, others proved to be beneficial for both languages.It should be noted that expert-designed features were combined with deep neural acoustic representations for predicting the holistic score.These embeddings were not intuitive from a pedagogical standpoint, but were useful in practice (Al-Ghezi et al., 2023;Bannò & Matassoni, 2023).

ASR performance
The ASR model for L2 Finland Swedish achieved 17.71%/9.08%WER/CER and for Finnish 21.89%/7.06%.For the ASR error analysis, the utterances from the test sets with the highest word and character error rates were analyzed.Table 6 shows some examples of ASR outputs with the corresponding reference transcriptions, as well as WERs and CERs.
As can be seen from L2 Swedish examples, some hesitations or complete words were missing from reference transcriptions, which resulted in relatively high error rates.For instance, the human transcription for example #1 missed the word "ledigheten" and its repetitions.In addition, sometimes the ASR model merged separate words (see example #4) possibly due to the lack of language modeling (LM) component.Another possible reason for that error might be background noise in the recordings or the speech rate of a speaker being too rapid.
Like L2 Swedish ASR, L2 Finnish ASR models often had high word and character error rates in sentences where words such as proper names were missing from the reference transcriptions.Also, single-character errors in words were quite common in L2 Finnish ASR outputs.For example, words "tutustua" and "väsyttää" were recognized as "tutustuaa" and "väsyyttää" in sentence #5, and words "psykologiaa" and "projektin" were recognized as "sykologiaa" and "projekkin" in utterance #7.In addition, the Finnish L2 ASR system recognized some words as completely different words.Even though no external language model was used in this work, these words were grammatically correct but had different meanings.For instance, the word "levännyt" ("have had a rest") is recognized as "hävennyt" ("have been ashamed") in sentence #6.

Scoring models
Tables 3 and 4 show that the machine-human agreement was higher than the humanhuman in nearly all analytical dimensions for both languages, except for lexico-grammatical (for both languages).One possible reason for the low performance of lexico-grammatical features was the absence of an external language model in the ASR, which not only led to minor, character-level spelling mistakes, but also led to inaccuracies in lexico-grammatical feature calculations.Lexical features such as TTR or types rely on existing words in external corpora, and any slight mismatch would affect the calculations.Similarly, features like depth root distance rely on dependency parsers which are trained on well-edited text.In addition, it should be noted that, unlike other scoring models, the lexico-grammatical ones did not use any deep embeddings, which suggests possibilities in conducting further experiments to improve the model performance by incorporating neural textual representations solely or combined with the acoustic ones.Higher machine-human agreement than the human-human agreement was expected since the scoring models were trained with the human average scores using a range of fluency, pronunciation, lexico-grammatical, and task achievement features.Using average scores may reduce human-related assessment bias, and modeling the human rater behavior, such as analyzing rater severity/leniency, can further improve assessment reliability.Re-training our scoring models with fair average scores that are based on many-facet Rasch models would be a useful extension of the present study linacre (Linacre 1989).

Limitations of the system
The automatic speaking system has several limitations that could be further addressed to improve its accuracy, reliability, and generalizability.As discussed previously, the ASR did not use an external language model, which may have resulted in some errors in the ASR outputs and negatively impacted the calculations of lexicogrammatical features.Another limitation was the system's lack of robustness against very noisy or other lowquality recordings, which is a common issue for all automatic systems.Additionally, the scoring models were more accurate at distinguishing between intermediate levels, rather than extreme ones, which were not adequately represented in the collected data.Furthermore, the deep acoustic representations, while having the potential to complement interpretable human-designed features, may seem unusable from a pedagogical standpoint since they were not interpretable.Finally, due to the incorporation of multiple deep neural models, the system was computationally complex, which could affect real-time interaction with users.Therefore, future engineering endeavors are required to compress them or use techniques such as knowledge distillation to reduce the latency and improve computational efficiency.

CONCLUSIONS
This work focused on developing automatic speaking assessment systems for spontaneous L2 Finnish and L2 Finland Swedish.The steps involved in designing the assessment tasks, collecting and processing the training data, training and evaluating the ASR systems, as well as the scoring models for pronunciation, fluency, lexico-grammatical competence, and task completion, were discussed.Self-supervised deep acoustic models, including Wav2vec2, were utilized to develop ASR systems with a relatively small amount of training data, which is particularly useful in the context of low-resourced languages.The scoring models for analytic and holistic dimensions exhibited a high degree of human-machine agreement for the targeted skill levels, indicating their potential for automated speaking assessment.In addition, the high performing expert-designed features were identified, and an additional type of feature, namely deep acoustic embeddings was integrated.
Our work contributes to the development of automated speaking assessment systems, especially for low-resourced languages, which could provide benefits to language learners and educators.Future research could explore providing diagnostic feedback from automated speaking assessment systems to learners for the purposes of formative assessment.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The work was supported by the Academy of Finland [322619, 322625, 322965].Tasks 2-8 are part of B1 or B2 test for high school students.Tasks 10-13 are part of the A-level test for university students.The A-level tasks have been translated to English for students.

Figure 1 .
Figure 1.Distribution of speech samples between ratings.the horizontal axis represents averaged scores and the vertical axis represents normalized number of samples.

Figure 2 .
Figure 2. Schematic diagram illustrating the components of the rating system.

Table 1 .
Characteristics of the rated Swedish and Finnish data.

Table 2 .
ASR experiments on L2 Finland Swedish and L2 Finnish.

Table 3 .
Comparison of human-human and machine-human evaluation metrics for Swedish scoring models.

Table 4 .
Comparison of human-human and machine-human evaluation metrics for Finnish scoring models.

Table 6 .
Example ASR outputs with corresponding human transcriptions, WERs and CERs.