Task-based language assessment: a compatible approach to assess the efficacy of task-based language teaching vs. present, practice, produce

Abstract Task-Based Language Teaching has been developed in response to the teacher-dominated, focus-on-forms methods such as Present, Practice, Produce (PPP). The body of literature is replete with studies examining the learning efficacy of the PPP approach versus TBLT; however, these studies did not use assessment tasks in comparing these two methods. To this end, the present study used an Assessment Task, a Grammaticality Judgment Test (GJT), and an Elicited Imitation Test (EIT) to compare the efficacy of PPP versus TBLT. Thirty-four lower-intermediate English language learners in Iran were randomly assigned to TBLT, PPP, and Control groups. The study results revealed that the performance of TBLT and PPP on the GJT and EIT significantly improved from pre-assessment to post-assessment, while the Control group did not show any significant improvements on any of the tests. Results indicated that only the TBLT group made substantial improvements in TBLA in the post-assessment, while the PPP and Control groups’ performance did not significantly improve.

learners" (p. 5). The core tenet of TBLT is that the prominent element in designing language curriculum, lesson plans, and even assessments must be a task (Bygate, 2016;Ellis, 2009;Samuda & Bygate, 2008). More specifically, tasks have been defined as "the real-world activities people think of when planning, conducting, or recalling their day," such as responding to email messages, making a sales call, attending a lecture, or a business meeting (M. Long, 2015). Task-Based Language Teaching takes language learners' needs analysis as its starting point to determine the target tasks that language learners ultimately need to perform using the target language (Ellis, 2017). In TBLT, a task syllabus incorporates a number of pedagogical tasks that are supposed to create the same context for language learners, similar to that of a real-life situation. Therefore, tasks play a pivotal role in designing, implementing, and evaluating a TBLT educational framework (Ellis, 2017;M. H. Long, 2016;M. Long, 2015).
The PPP approach follows three main steps: first, the target language items are presented to language learners through examples; then, language learners practice the target language in a strict, controlled manner through the use of exercises and drills. Finally, language learners are asked to produce the target item more spontaneously. M. Long (2015) holds that the PPP lesson structure includes the presentation of dialogues and reading comprehension passages which are geared towards the intended grammar of the lesson, then mechanical drills and written exercises are intensively practiced, and ultimately students are given a chance to practice more freely through what Long calls "pseudo-communicative language use" as opposed to authentic language use (p. 20). Thus, the main focus of PPP is to elicit accurate target language production from day one (Shintani, 2013).
Over time and with the advancement of the field of second language acquisition, the PPP approach and its supporting arguments have been criticized. First, the outcome of the PPP approach produced language learners who had poor skills in communication, and only a certain group of gifted students reached high levels of proficiency through this approach (Skehan, 1996). Thus, the PPP approach failed to meet the high levels of achievement in all four skills. Second, the underlying theory of the PPP approach has been repudiated. The idea that habit formation can help learners master a language has been devalued by shifting towards more cognitively oriented approaches. Furthermore, the premise that language learners can master a language broken down into bits has been widely questioned in the field of second language acquisition. Skehan (1996) holds that simply presenting the language learners with the language does not guarantee the acquisition as learners' process of internalizing the language is more complex than that. Additionally, the PPP approach seems to ignore the role of interlanguage in the learning process. Too much emphasis on accuracy at the cost of losing fluency is another major drawback of the PPP approach.

Task-Based Language Assessment (TBLA)
In the early 1990s, the field of education called for a new paradigm of assessment, widely known as an alternative assessment, which measured and emphasized the ability to use knowledge; Attempts such as developing performance assessments were made to meet this goal (Norris, 2016;Norris & East, 2021). Along the same line, in the field of language education and with the ever-increasing popularity of TBLT, there was a need for an alternative assessment that measured the ability of language learners to apply language in the performance of authentic tasks rather than simply measuring rote memorization and discrete facts (Norris, 2016). To this end, TBLA, also known as task-centered assessment, task-based language performance assessment, and taskbased assessment, was developed (Bachman, 2002;Mislevy et al., 2002;Norris, 2002;Norris et al., 2002Norris et al., , 2002. Task-Based Language Assessment is a methodology of assessment that measures the ability of language learners to express and understand meaning to achieve a certain goal or outcome within an authentic and communicative context (Norris, 2016;Norris & East, 2021).
TBLA is concerned with the language learners' ability to apply their language knowledge to accomplish and achieve the goal and purpose of the task, similar to how they use their first language to get things done in daily life. The tasks can range from simple daily things, i.e., ordering pizza in a restaurant and responding to emails, to more professional activities, such as listening to an academic lecture in class and taking notes. Thus, TBLA was developed to fill the gap of an appropriate assessment that measures what language learners can do in the target language. In other words, what distinguishes TBLA from other modes of assessment is that performance is in the construct of assessment in TBLA (Long & Norris, 2000). Brindley (2013) puts forth a couple of advantages for TBLA, with particular attention given to classroom-based assessment. He states that TBLA directs teachers' and students' attention to using language as a tool for communication rather than focusing on language knowledge as an end, which is the case with most traditional language testing methods. Additionally, "TBLA integrates learning process and assessment through the use of attainment targets directly linked to course content and objectives" (Brindley, 2013, p. 2). TBLA lays the groundwork for language learners to receive diagnostic feedback to compare their task performance with the clear performance criteria. Furthermore, TBLA utilizes various forms of reporting the assessment outcome in terms of performance comprehensible to non-specialists. This would foster communication between the people who want to use performance information and the educational institutions (Brindley, 2013). Norris (2016) argues that TBLA provides the opportunity to examine multiple aspects of language ability and development, such as accuracy, fluency, complexity, procedural knowledge, and pragmatic proficiency through a single performance. He argues that TBLA has positive washback effects in that it triggers educators and teachers to reconsider how teaching and learning happen. As an illustration, after TOEFL iBT was introduced using academic tasks, the emphasis of instructional approaches in Europe shifted towards teaching language skills and strategy. In light of these advantages, Norris (2018) highlights a couple of challenges in using TBLA, such as task selection, replication of authentic context, assessment of task performance, and generalization of the task performance to real-life situations. Norris and East (2021) contend that TBLA as an assessment approach must reflect the tenets of its teaching approach, i.e., TBLT for validity reasons. They further argue that there should be a "constructive alignment between learning and outcomes" by using tasks to create a reciprocal relationship between teaching and assessment approaches (Norris & East, 2021, p. 517). This premise inspired the present study to indicate that the results of studies using any other means of assessment to evaluate the effectiveness of TBLT should be considered with caution. Many studies examining the effectiveness of TBLT vs. PPP have not used TBLA to evaluate the efficiency of TBLT. For instance, Li et al.'s (2016) study only used tests typical of the PPP methodology to assess the effectiveness of TBLT, which is bound to yield results that show only part of the whole picture, if not a distorted one. In essence, the tests of GJT and EIT are not designed to assess the learners' communicative competence; therefore, they are not a suitable means to assess the effectiveness of Task-Based Language Teaching. In other words, PPP and TBLT should be assessed according to their appropriate tests to obtain a more comprehensive and realistic view of the comparison of the two language teaching methodologies.
Thus, to this end, the present study added Assessment Tasks to its assessment package alongside the GJT and EIT to obtain more realistic and complete results. The study evaluated the TBLT and PPP methods in terms of their instructional effectiveness using their respective assessment; that is to say, TBLT was assessed by TBLA and PPP by the GJT and EIT. The previous research neglected to use the compatible assessment in the evaluation of TBLT vs. PPP (e.g., González-Lloret & Nielson, 2015;De La Fuente, 2006;Lai et al., 2011;Li et al., 2016;De Ridder et al., 2007;Shintani, 2011Shintani, , 2013. More specifically, Li et al. used the GJT and EIT, which are typical assessments for the PPP approach to examine the effectiveness of TBLT, while these tests are not suitable to evaluate the effectiveness of TBLT. To address this problem in the literature, the researcher compared the differential effects of these two language teaching methodologies by using their respective assessment.
In the present study, the GJT and EIT were supposed to measure students' explicit and implicit knowledge, respectively. Assessment Tasks were used to investigate the students' ability to use the language in the context of language use. The GJT was used to assess how well the students were aware of the grammatical rules of the target feature. The EIT was used to assess how fast the students could recognize and use the target structure; this required that the students have unconscious and automatic knowledge of the target structure. Last but not least, the study used Assessment Tasks to examine the students' application of the target linguistic feature in the context of language use. Using these different types of tests yields a more comprehensive picture of the effectiveness of these methodologies.

Review of the literature
There have been several studies conducted to compare the effects of TBLT instruction with traditional PPP instruction (e.g., González-Lloret & Nielson, 2015;De La Fuente, 2006;Lai et al., 2011;Li et al., 2016;De Ridder et al., 2007;Shintani, 2011Shintani, , 2013. De La Fuente's (2006) classroombased study examined the differential effects of TBLT and PPP on the acquisition of meanings and forms of vocabulary. The study focused on the effects of two L2 vocabulary task-based lessons, one of which utilized an explicit, teacher-generated focus-on-forms component (TB-EF) and the other one without a teacher-generated focus-on-forms component (TB-NEF) on the acquisition of meanings and morphological aspects of L2 words. More specifically, the study attempted to determine whether TBLT lessons are more effective than PPP lessons in enhancing the learning of L2 vocabulary and morphological aspects. The research presented 30 students with the treatment after they had finished 43 hours of communicative L2 instruction in Spanish. In the PPP approach, students were presented with the dialog, then read the dialogue out loud; finally, they were asked to have a role play. The TB-NEF lesson involved a pre-task, task cycle, and task repetition phase. The results indicated that the task-based lessons with a built-in, planned focus on form were more beneficial than PPP lessons since they provided students with more opportunities to negotiate meaning, output production, and online retrieval of target words.
De Ridder, De Ridder et al.'s (2007) study examined the effects of a task-based approach on improving L2 learners' automaticity. The study sample included 68 intermediate-level students of Spanish as a foreign language for Business and Economics at the University of Antwerp. The participants were randomly assigned into two groups of control, including 35 students, and experimental involving 33 students. Both the control and experimental group had to attend four stages of the course. The first three stages were the same, where both groups were presented with systematic or focus-on-form components: presentation, explanation, and exercises. The fourth stage was different for the experimental group, who had to attend 10 hours of instruction on a task-based instruction called prácticas comunicativas. The study concluded that the task-based approach stimulated the process of automatization better than a purely communicative course with a robust systematic component. Shintani's (2013) study investigated the differential effects of input-based focus on form (FonF) and production-based focus on forms (FonFs) on learners' vocabulary acquisition. In the study, FonFs was operationalized through the present-practice-produce (PPP) approach, while FonF was implemented via a task-based method. The participants were 45 six-year-old L2 learners of English from Japan with no prior experience of English language instruction. Five oral activities were carried out in FonFs condition following a PPP approach. For instance, in one of the activities, the students were required to repeat individual words, and in another activity the students were asked to repeat words on flash cards both individually and in groups. Three tasks were used in the FonF condition, which could only be completed by understanding the input. The tasks involved the learners listening to the teacher's orders and responding by choosing the correct card and putting it in the right holder. To assess students' performance, Shintani used two tests: a discrete-item word production test and a "Same or Different" task test. The discrete-item test required that students name the target vocabulary item on 24 flash cards. The "Same or Different" task involved a pictured sheet where students were asked to answer the researcher's questions, such as "What color is it?" or "My soap is pink. Is you soap pink?" so they could figure out if their pictures were the same or different. Shintani found that the FonFs group significantly outperformed the FonF group in both the discrete-item test and the "Same or Different" task on both the immediate and delayed post-tests.
González-Lloret and Nielson's (2015) study took into account the effectiveness of a task-based Spanish program implemented at the United States (U.S.) Border Patrol Academy (BPA). The researchers hypothesized that the TBLT group would perform better than grammar-based students on fluency, lexical complexity, and syntactic complexity measures. In contrast, the grammarbased students would outperform TBLT students on grammatical accuracy. The participants included 20 students from the TBLT course and 19 from the grammar-based group. The results of the study showed that the TBLT students performed significantly better than grammar-based students on measures of fluency and structural complexity. However, in terms of lexical complexity, no significant differences were found. Besides, there were no significant group differences in grammatical accuracy. Li et al.'s (2016) study compared the differential effects of task-based and task-supported language (TSL) instruction on the acquisition of the English passive structure. The researchers operationalized the three different task implementation procedures as Focus on Meaning, TBLT, and TSLT on students' explicit and implicit knowledge of the passive structure. The participants of the study were chosen from five eighth-grade classes, with 55 to 60 students each. Thirty students were then randomly assigned to five groups: one control group and four experimental groups. The experimental groups were presented with a two-hour treatment during which they had to do two dictogloss tasks in which the passive structure was used. The experimental groups had four different instructional conditions (a) Focus on Meaning (FoM), (b) The TSLT group, (c) The Focus on form (or pure TBLT) group, (d) the "stronger" version of TSLT. The control group solely took a pre-test and post-test only. The researchers used a GJT and an EIT to measure learning. Results of the study indicated that there were limited effects for the FoM condition on students' learning of the passive structure. Following Sheen (2003) and Swan's (2005) criticism of the lack of enough empirical evidence to support the higher efficiency of TBLT over the traditional focus-on-forms approaches of language teaching, such as PPP, there was a necessity to conduct more studies to delve more into this issue. To this end, the present study aimed to build upon the previous literature on this issue (e.g., González-Lloret & Nielson, 2015;De La Fuente, 2006;Lai et al., 2011;Li et al., 2016;De Ridder et al., 2007;Shintani, 2013). Below a summary of these studies is highlighted in Table 1.

Research questions
The research questions of the study focus on three concepts: (a) Comparison of the two language teaching methodologies, i.e., PPP and TBLT; (b) the type of assessment used in the evaluation of PPP vs. TBLT, i.e., the GJT, the EIT, and Task Assessment; (c) the type of knowledge that language learners mastered, i.e., the declarative and automated knowledge. Below the research questions of the study are presented in detail.
(1) Do PPP and TBLT treatments have different effects on learners' performance of the GJT (explicit/declarative knowledge) compared to the control group?
(2) Do the PPP and TBLT treatments have different effects on learners' performance of the EIT (implicit/automated knowledge) compared to the control group?
(3) Do the PPP and TBLT treatments have different effects on learners' performance of Assessment Task compared to the control group?

Participants
The participants of the study were chosen from an English Language Institute entitled Parsian Language School in Mazandaran province, located in northern Iran. Parsian Language School is a privately-owned language institute that teaches the English and French languages. The study included 18 female and 16 male Iranian English as a Foreign Language (EFL) learners. The English proficiency level of all students had already been assessed by the Parsian Language School upon their entrance into the school, as they were obliged to take the placement test prior to the commencement of their English courses. The participants of the study had been studying in the Parsian School for at least three semesters and were at the intermediate level of proficiency at the time of research. The learners' proficiency was measured by the end-of-the-semester oral and written achievement exams measured by quick and progressive tests adopted from the American English File series published by Oxford University Press. The instructor in the language school had the leeway to alter the test, i.e., adjust, add to, or take away from it, to better meet the needs of their class. The participants ranged from 15 to 32 years of age. The participants were sampled because of their availability and convenience of participation. The students of Parsian School had not been taught the target feature of the study prior to the commencement of this research. The participants' grouping was based on the class list of the school; that is, the students were sampled conveniently based on their school's class lists. To ensure the students' voluntary participation in the study, the researcher required them to read a synopsis of what the study was about and then sign a written consent form to participate in the study.
Additionally, two instructors at the Parsian institute cooperated in the study. Both instructors were in the midst of their doctoral studies in the field of applied linguistics. They both had the experience of taking courses in language teaching methodology and Task-Based Language Teaching in pursuit of their doctoral studies; therefore, both were familiar with the TBLT and PPP methods. The instructors were male and in their mid-thirties. They had over ten years of teaching experience in teaching English as a foreign language in the Iranian context. The teachers were given instructions on how to administer the test and how to do the treatments, especially the task implementation, according to Willis's and Willis's (2007) model. One instructor undertook the TBLT treatment, and the other did the PPP treatment.

Design
The study used a quasi-experimental design with a format of pre-assessment and immediate postassessments. The study was conducted in two phases: (a) the students were administered the preassessment, (b) the students were given the instructional treatment and then were immediately given the post-assessment. The assessments included a GJT and an EIT test, which measured students' declarative and automated knowledge, respectively, and an assessment task that aimed to assess students' application of the target language in the context of authentic language use. The GJT and EIT were the same across the pre-assessment and post-assessments except that the order of items was switched to avoid the practice effect on the tests.

Treatment
The treatment for the TBLT group included two focused tasks. The first one was a picture task, including nine items. Each item in this task included two pictures indicating their change over time.
The task story was about a hypothetical character named Rebecca and her brother, who recently moved away, and a lot has changed in their house since then. Rebecca was going to write her brother a letter explaining what changes she had made since he moved. The second treatment task required the students to read a note left by the mother of the story's character, Cindy, asking her to do some chores while she (Cindy's mom) was away for some days. The students were asked to work in pairs and help Cindy write a text message to her mom, reporting that she did every single task she had been asked to do.
The TBLT instructor used Willis's and Willis's (2007) model to implement the tasks. Willis's model includes the following phases: a pre-task, task cycle, and language focus. In the pre-task phase, the instructor activated students' background knowledge through a warm-up to performing the task, which was done by asking questions relevant to the topic of the task. In this phase, the instructor ensured that the students understood the task's instructions by explaining to them thoroughly. In the task cycle, the students first performed the task, and then they got prepared to report to the class either orally or in a written mode about how they planned to undertake the task. Afterwards, in the third phase of the task cycle, i.e., the report stage, some pairs or groups were selected to report how they planned the task to the whole class. In the language focus phase of Willis and Willis's model, the instructor followed two stages: analysis and practice. In the analysis stage, the instructor examined and talked about the target features of the task, namely the past passive voice. Finally, in the practice stage, he dwelled on the linguistic forms by reviewing the words, past passive voice, and patterns in the task trying to direct students' attention to the intended linguistic features of the task (Willis & Willis, 2007).
The students in the PPP group went through 30 minutes of instruction on the past passive voice, and then they had to practice the structure through the following activities: (a) a discrete item activity, including 10 items where the students were required to convert the verb in the parenthesis to the passive voice to complete the items, (b) a close passage including seven blanks where students had to read the passage and complete the blanks within the passage by the choosing the correct form between two option provided in the parenthesis, and (c) a transformation activity including 10 items of complete sentences in the active voice; the students had to change the sentences into the passive voice. As for the production phase in the PPP, treatment activities, such as repetition, substitution, restoration, and backward build-up drills were used.
The treatment for the control group consisted of five reading passages taken from a reading book entitled Intermediate Steps to Understanding, authored by Hill (1980). The passages were at the intermediate level of proficiency according to the book. At the end of each of the five passages, there were comprehension questions that students needed to answer. The students were asked to read the passage and then answer comprehension questions. The instructor followed up by reading the passage to students, explaining difficult words, and making sure the students comprehended the passage.

Data collection
The researcher chose a pool of English Language Learners at the English Language Institute (ELI) at Florida International University for the pilot study. The classes were held on weekdays except for Friday. The researcher used three classes on Monday, Tuesday, and Thursday mornings. The instructors of the classes were kind enough to offer their whole class time to the data collection. The students in the ELI classes took all three tests of the study in the language laboratory of the institute. They first took the GJT, then they were recorded doing the EIT, and finally, they took the Assessment Tasks. The pilot study revealed some minor issues with the data, such as the vagueness of instruction. In addition, the pilot study confirmed that 8 seconds' interval between each of the EIT items would be enough time to respond and repeat the EIT items. Having fixed these issues, the researcher administered the pre-assessment tests to the three groups of PPP, TBLT, and Control.
The pre-assessment session was held in the English Language Institute. The three classes for the pre-assessment included students studying the English language book entitled "Interchange" at the intermediate level in this institute. The three classes were randomly assigned to three groups of Control, PPP, and TBLT by pulling their names out of a hat. The students of the TBLT group were the first to receive the pre-assessment. Pre-assessment and post-assessment involved a GJT, EIT, and Assessment Task administered to students. The GJT and EIT were borrowed from Li et al.'s (2016) study, as the present study is a quasi-replication of it, but the researcher designed the Assessment Task to specifically distinguish it from other studies in the body of literature.
The instructor spent the Monday class administering the assessments to the students of the TBLT group. He explained to the students that they had to judge the grammaticality of 40 items. He then distributed the GJT among the 13 students of the class. The students were asked to start their second test, i.e., the EIT, upon completing the GJT. The PPP instructor followed the same procedure for the PPP and Control groups. Both instructors ensured that the students knew what they were required to do by the tasks during the pre-assessment and post-assessment phases.
The GJT was an untimed test including 40 items whose grammaticality was to be judged by students. This test was supposed to assess the students' explicit knowledge of the target feature, i.e., the knowledge of the rules and regulations of the past passive voice in English (See Appendix A). Out of those 40 items, 30 were about the target structure, i.e., the past passive structure in English, and 10 were distractors; that is, the sentences using structures other than the past passive voice so that the students could not realize the structure which was assessed. The students were asked to put a "C" in the blank in front of the sentences they believed were correct and an "I" in front of the incorrect ones. Additionally, they were asked to write the correct English version of those sentences they believed were incorrect. Students received one point for answering each item correctly. The distractors did not count in the scoring of the GJT.
The EIT was designed to assess the learners' automated knowledge of the target structure (See Appendix B). The time taken for students to react to the test item indicates their automated knowledge of the form. The students were required to listen to the recording of a native speaker of English reading 35 items. There was an eight-second time interval between each item, which had been determined through a pilot study, after which students' reaction time to answer each item was calculated. The EIT included thirty-five items, five of which were distractors; that is, they were not assessing the target structure and were not counted for the analysis. The EIT required students to listen to each item played by the digital voice recorder, determine whether each item was true of their life, and then repeat the item in correct English within eight seconds. As an example, for the item: "My grandfather was treated at the hospital last week," the students had eight seconds first to indicate if this happened to their father by saying yes or no and then repeat that sentence in correct English. The students received 1 point for answering each item correctly; thus, the test was scored on a scale of 0 to 30.
The Assessment Task was a narrative task that required the students to read a hypothetical short story about a robbery in Miami, USA (See Appendix C). The story used "active voice" to explain the robbery scene and what transpired during the theft. The narrative used the simple past tense to recount the robbery. The Assessment Task asked the students to imagine that they were reporters for a school magazine, and they were to replicate a story about a robbery that they read in the morning in their local newspaper. The task required the students to change the story's style using the past passive voice to avoid plagiarizing or copying the original story. The total number of sentences to be changed into the passive voice was 17, and students received one point for accurately changing each active voice structure.
The post-assessment tests were the same as the pre-assessment tests; the only difference was that the order of the items for the tests of the GJT and EIT was changed for the post-assessment to prevent the practice effect. In the post-assessment session, the instructor administered the GJT first, where the students were required to judge the grammaticality of 40 items. If they found errors, they had to write the correct form underneath each item. Having administered the GJT, the instructor asked the students of each group to take the EIT. The EIT test took about five minutes as the students had eight seconds to answer and repeat 35 items. Afterwards, the narrative task was administered to each group. The instructor ensured that students understood the task's instructions well by explaining the tasks and asking warm-up questions.

Test of normality
To analyze the data, first and foremost, a test of Shapiro-Wilk was run to assess the normality of the data. The results of the Shapiro-Wilk test revealed that the data for the GJT in the pre-assessment, the EIT in the post-assessment, and Assessment Task in the post-assessment were not normal (p < .05). As for the pre-assessment analysis, the test of Kruskal-Wallis indicated that there was not a significant difference among the three groups of TBLT, PPP, and Control on any of the tests of the GJT, the EIT, and Assessment Task. The results of the Kruskal-Wallis test are shown below in Table 2.
Since the data were not normal, a Wilcoxon Signed Rank Test was run to see the differential performance of each of the groups on each pair of the tests of the GJT, the EIT, and Assessment Task (Hollander & Wolfe, 1999).

The grammaticality judgment test
The mean score for the GJT and the standard deviation of all the groups on the pre-test and posttest for this test are displayed in Table 3. The mean scores in Table 3 indicate that the TBLT group, followed by the PPP group, had the highest increase in the means.
The results of the Wilcoxon Signed Rank Test revealed a significant difference in the performance of the TBLT and PPP group from the pre-test to the post-test (z = −2.52, z = −2.15, respectively, p < 0.05). The effect size value suggests a moderate practical significance for both TBLT and PPP on the EIT, d = .50. As for the control group, the Wilcoxon Signed Rank Test failed to show any significant effect on the control group's performance between the pre-test and the posttest (z = −0.73, p < 0.05. The effect size value on the EIT for the control group suggests little practical significance, d = 15. The results of the Wilcoxon Signed Rank test for all of these groups are shown below in Table 4. The post hoc analyses revealed the statistical power for this study was  relatively small at 0.50. Additionally, the statistical power for the PPP group with an effect size of 51 and the Control group with an effect size of 017 was 47 and 05, respectively, as shown below in Table 4.

The elicited imitation test
The mean of each group's performance increased on EIT, ranging from 16.75 to 17.30 on preassessment to 17.33 to 18.61. Additionally, the standard deviation for the groups showed more variation for the two groups of PPP and Control, ranging from 2.87 to 4.01 on preassessment to 3.44 to 4.16 on the post-test. Below, Table 5 provides the descriptive statistics for the EIT.
The Wilcoxon Signed Rank Test was run for the EIT assessment, and the results showed a significant effect for the TBLT and PPP group but not for the control group. The Wilcoxon Signed Rank Test confirmed a significant effect for the TBLT group on students' performance of the EIT, z = −2.58, p < 0.05, and a moderate practical significance, d = .54. As for the PPP group on the students' use of the target structure, z = −2.20, p < 0.05 with the effect size having moderate significance, d = 45. Nonetheless, the results of the Wilcoxon Signed Rank Test did not yield any significance for the Control group, z = −.660, p < 0.05. The results of the Wilcoxon Signed Rank Test are shown in Table 6. The post hoc power analysis revealed that the statistical power for the TBLT, PPP, and Control groups was 55, 41, and 05, respectively.

Assessment task
The descriptive statistics for the Assessment Task indicated an improvement for the mean of all groups ranging from 12.87 to 13.61 on the pre-test to 13.58 to 15.20 on the post-test. Additionally, the standard deviation of the groups over time shows less variation for all groups from pre-test, ranging from 1.83 to 2.16 to post-test ranging from 1.60 to 1.99 on the posttest. The descriptive statistics of each group's performance on the Assessment Task are shown below in Table 7.  The Wilcoxon Signed Rank Test indicated significant results only for the TBLT group on the Assessment Task z = −2.76, p < 0.05, while there were no significant results for the PPP and Control groups. The effect size value on the Assessment Task for the TBLT group suggested a moderate practical significance, d = .54. The Wilcoxon Signed Rank Test for the PPP group revealed that their performance narrowly missed the level of significance z = −1.93, p < 0.05. The effect size value on the Assessment Task for the PPP group indicated a small to moderate practical significance, d = .39. As for the Control group, the Wilcoxon Signed Rank Test showed no significant difference between the performance of the students in this group in two phases of pre-test and post-test (M = 12.87, SD = 2.16), z = −1.34, p < 0.05. The results of the Wilcoxon Signed Rank Test are shown in Table 8. The post hoc power analysis revealed that the statistical power for the TBLT, PPP, and Control groups was 55, 34, and 22, respectively.

Post-assessment analysis
In the end, the researcher conducted a Kruskal-Wallis test one more time to see if the treatment was effective enough to have resulted in any difference among the three groups. In other words, as opposed to the Wilcoxon Signed Rank Test, which assessed the within-group differences, the Kruskal-Wallis aimed at finding between-group differences. The test of Kruskal-Wallis indicated no significant difference among the three groups of TBLT, PPP, and Control on any of the tests of the GJT, the EIT, and Assessment Task (p > .05). The results of the Kruskal-Wallis test are shown below in Table 9.  The Wilcoxon Signed Rank Test revealed that the treatment was effective in helping the TBLT group on all the tests and the PPP group on only two tests of the GJT and EIT. Therefore, based on the data analysis, we can conclude that the treatment was effective, but only to the extent that there was a within-group improvement. This treatment, however, was not effective to the extent it could cause any between-group difference.

Discussion
The results of the study indicated that, overall, the TBLT group showed better performance than the PPP and Control groups, respectively. The results could be explained from different perspectives. First, the TBLT group seemed to have been more effective than other groups in fostering both attention to the linguistic features and the meaning to be conveyed. An examination of the results of the study on all the tests confirmed that TBLT was effective in improving students' grammatical accuracy by having them focus on form, which occurred incidentally when students' focal attention was on meaning conveyance.
In essence, the results indicate that the performance of both the TBLT and PPP groups was significantly better in developing explicit knowledge than the Control group. As far as explicit knowledge is concerned, the results of the study suggest that the TBLT group showed a more significant result than the PPP and Control groups. One rationale for this result could be that the language focus part of the TBLT treatment prepared the students of the TBLT group to outperform the other groups. Besides, the students in the TBLT group enjoyed the feedback on the teacher's side, which was, to a significant part, linguistically directed. More importantly, due to the use of focused tasks in the TBLT treatment, the students had the chance to be directed to the linguistic feature frequently in the context where their focal attention was on meaning conveyance. This constant encounter seemed to have helped them pay attention to the linguistic feature and, therefore, master it. Nassaji and Fotos (2011) contend that when learners become conscious of linguistic features, they notice them in subsequent communicative input. Such noticing helps them restructure their implicit knowledge of the linguistic feature.
As for automated knowledge, the TBLT group performed better than the other groups. As mentioned above, automated knowledge leads to a quick response, which is typically unconscious and automatic. The explanation for the outperformance of the TBLT group on this could be that this group had already been exposed to and experienced using the target linguistic features in an unconscious and incidental manner in a context where their focal attention was on the meaning conveyance. As such, the TBLT students were more successful in forming relatively automated knowledge. While the mechanical drills that the PPP group practiced did not seem to have developed the automaticity needed for the EIT, the control group similarly had no preparation to face this type of test.
Additionally, another explanation for the better performance of the TBLT group as compared to the PPP and Control is that they had already had the opportunity to use their explicit knowledge in a communicative context. This argument is in line with DeKeyser's (1998DeKeyser's ( , 2007 skill-learning theoretical model, which holds that explicit knowledge could be transformed into implicit knowledge if it is continuously used and in an incidental manner in a communicative context. This is precisely what the TBLT condition offered its students, i.e., a context to practice and use their already acquired explicit knowledge. Gaining more realistic results on automated knowledge necessitates more longitudinal studies since automated knowledge requires a longer time to develop than explicit knowledge.
As for the third part of the research question, the study's findings revealed that the PPP condition failed to prepare students to perform a task using the same linguistic features they have learned. In effect, this is a significant outcome of the study to indicate that the mere explicit command of a particular linguistic feature does not necessarily guarantee the successful use of that linguistic feature in communication. This finding may not have been obtained if the study did not use TBLA. Therefore, it has also been concluded that tasks can set the grounds for the learners to learn the contextual discourse needed to use the target feature. In addition to the fact that TBLT fostered both explicit and implicit knowledge, it provided the language learners with the necessary discourse to master the target feature in the context of language use. In a nutshell, this contextual discourse could be a task-based counterpart of explicit and implicit knowledge that shows students how to use a structure in conjunction with a situation and the discourse thereof.
More importantly, the results of the study highlight the significance of using assessment tasks in evaluating the effectiveness of TBLT. In effect, if the study had chosen to use any other assessment methods than TBLA, which is a case in point in Li et al.'s (2016) study, the results might have been completely different. The ultimate goal of every instruction is to enable the students to apply what they have acquired to real life; the declarative and automated knowledge should ultimately help learners perform with the language. The presence of these types of knowledge would be meaningful if they prove to be significantly useful in the context of language use. This is in line with the tenets of performance-based testing, where students' successful performance is evaluated through how they have completed the task rather than how they used the language to do the task. In performance testing, the evaluation criterion is not separate from the tasks. The study had two important findings: a) TBLT better prepares students to perform with the language they have learned, and b) TBLA provides a more holistic measure of students' ability to use the language and to perform a real task; without TBLA, evaluation of TBLT might yield distorted results.

Limitations
There are a couple of limitations to this study. First, the final number of students used in this study was limited. The study began with 62 students in the first phase of the study; however, in the post-test phase, the number of students cut down to 34. Additionally, the one-hour instruction of the study as the treatment was too short to come to a robust finding regarding implicit knowledge and task performance. Explicit knowledge might well be developed through a short one-hour treatment, but this is not the case with implicit knowledge that, due to its automatic and incidental nature, requires more rehearsal and practice to develop than a onehour instruction.

Conclusion
In summary, the main purpose of the present research was to fill the gap in previous literature by using tasks as a means of assessment when comparing TBLT with other modes of instruction. The study's results empirically support Norris and East's (2021) argument that there must be a constructive alignment between teaching and assessment. They posit that there must be a reciprocal relationship between the tasks that students perform in the class and those they do in the assessment setting. This statement per se highlights the importance of using tasks to evaluate TBLT. The study has shown a slight advantage for the TBLT group in learning the target structure, which was assessed through task performance. Therefore, the study concludes that the best way to assess the instructional outcome of a TBLT class is by using its compatible assessment method, i.e., TBLA.
Another critical aspect of the present study was using the classroom's actual teacher. Most of the previous studies on this subject used their researcher as the instructor in the study (e.g., De La Fuente, 2006;Lai et al., 2011;De Ridder et al., 2007;Shintani, 2011Shintani, , 2013, which would make their results biased. Other studies, such as Lai et al. (2011) and Li et al. (2016), used instructors with no prior training on TBLT, thereby making the findings of their studies questionable. However, there was more than one class with their own teachers used in this research. Using the classroom's actual teacher in the study had some advantages. For instance, it added to the ecological validity of the study as the classroom retained its naturalness. In simple terms, the class was not altered to cater to the purpose of the study.
Future research needs to delve into this issue by using longitudinal treatments to account for the effects of these language teaching approaches over longer periods of time. Additionally, studies using language learners at different level of proficiency could better shed light on the effectiveness of these methods across different proficiency levels.