Test case generation and history data analysis during optimization in regression testing: An NLP study

Abstract The generation of test cases to verify and validate the actions of software or an application, as per the customers’ requirements, is an indispensable activity in software industries. A tester could construct test cases to suffice various objectives, which could be random or task-oriented at times. Most of the time, test cases are generated based on clients’ specifications or requirements. These requirements are structured in natural language, and manual derivation of test cases from such client-stated requirements could be a cumbersome and time-absorbing activity for testers. Until recently, many practitioners have proposed a natural language processing (NLP)-oriented solution to automate or semi-automate the manual process of generating test cases from requirements; nevertheless, such studies imposed a restriction on how the clients should document or represent their requirements. This study, on the contrary, suggested an NLP solution that considers free-format user requirements and applies text pre-processing, a combination of dependency parser and RAKE process, along with a statistical similarity measure and template-based natural language generation (NLG) to translate them into detailed test cases. Apart from test case generation, with the aid of NLP tactics, this study has also proposed a solution for encoding the historical data of test cases into numerical values. Such numerical scores serve as valuable data and create the proper insight for testers during test case optimization.


Introduction
After or during development, the software undergoes a variety of periodic modifications to meet the needs of the end users.However, these caused alterations could result in the failure of essential software functions.To ensure the quality of the updated software and the compatibility of the new code with the old features, regression testing is practiced (Gupta & Mahapatra, 2022;Lafi et al., 2021).During this testing, new test cases are defined, developed, and executed in order to verify and validate the upgraded software; nevertheless, it is indicated in studies that one of the most challenging stages of the software test lifecycle is test case generation (TCG), requiring between 40 and 70 percent of the entire effort (Fischbach et al., 2023).Typically, a team of testers manually evaluates the specifications records (system requirements, functional requirements or test case specifications) in order to generate the test cases.These specification-based documents are generally the initial software documentation level that describes the functionalities that need to be tested.Since manual generation is time-absorbing, error-prone, and labor-intensive, researchers are rapidly adopting novel methods for automating the TCG procedure (Lafi et al., 2021).
One of the primary sources of information for producing test cases is specifications/requirements.Because requirements must be straightforward to express, use, and comprehend, they are usually expressed in natural language (NL), namely English.Thus, NLP approaches are a suitable tool for generating test cases.Using NLP approaches to extract desired information from massive amounts of raw data has produced encouraging results.This has also boosted the interest of researchers in NLP techniques, particularly in the domain of automating software development tasks, including TCG (Ahsan et al., 2017).Several prior research works have addressed the semiautomated or automated generation of test cases from software requirements or a blend of user stories and acceptance criteria, but these specifications involved a structured or semi-structured appearance.For example, a user story has a pre-specified framework provided by behavior-driven development, i.e., As a [role], I want [functionality], so that [business value].Each user story's context and consequence are referred to as acceptance criteria and are expressed in a Given/ When/Then format (Rane, 2017).In the actual world, users' feedback regarding any software's functionality or failure is seldom structured and may be prone to ambiguity and inaccuracies.
In IT industries, the process of updating software is a never-ending operation, which leads to an increase in the software's overall size and level of complexity.With this, the amount of TCG and execution also amplifies.As software testing experiences time and cost constraints, the execution of sizeable test cases is almost unachievable.Therefore, there has been an upsurge in the popularity of test optimization also during the past few years.Multiple tactics, approaches, and methodologies have been suggested in an effort to optimize the number of test cases (Lou et al., 2019;Rizwan et al., 2022).Since regression testing is not a one-time operation, it should always be associated with some form of memory function.Given this, this study also concentrated on research studies that took test execution history into account during optimization.
Test execution history provides a summary of the total number of tests that were carried out over the course of a particular time span (Rizwan et al., 2022).Including the fact that it keeps track of the number of test cases that passed, failed, or were abandoned, it also holds the description and severity rating of faults that were discovered by failing test cases.It has been observed that the studies associated with the test execution history for optimization lack a lucid description of how the information from the historical data that they evaluated is retrieved, what kind of feature in the history they analyzed, and whether the extraction is manual or automated from enormous historical datasets.Some studies have simplified historical value as time-ordered observations from earlier runs to provide a selection likelihood to test cases (Wang & Zeng, 2016), but what data those previous runs hold is absent and should be discussed in-depth.
To address such gaps, this study has considered all the above-mentioned reasons and propounded an NLP solution for TCG from merely structured specifications and for handling historical data during the optimization of test cases.This study has undertaken the user feedback documents and converted them into the desired tabular structure through syntactic analysis and with a phrase identification algorithm.These user feedback documents are a kind of questionnaire that possess free-form paragraphs written by users.Each paragraph in these documents is evaluated according to a specific priority.Through a similarity detection algorithm, the data in the tabular structure is fine-tuned.Later, this improved version is utilized to update the existing software requirement specification (SRS) document with the help of NLG.The updated SRS and the maximum probability of the existence of faults in a specific test code's module would aid in the generation of test cases.
Moreover, along with syntactic analysis, this study has also exploited NLP approaches such as named entity recognition (NER) to map the test execution history reports to a score in a range of [0][1][2].NER is likely the initial step in information extraction, which aims to discover and categorize named entities in texts into pre-defined sections, such as the expressions of times, quantities, names of organizations, persons, locations, etc.The obtained scores would further explain the criticality of executing a test case in terms of its past execution information while performing optimization.
The layout of this study's organization is as follows: Section 2 clarifies the fundamental concepts underlying each mentioned terminology and method.Section 3 covers the overview of the works that have addressed NLP for TCG and have taken test cases' history into account during optimization.The overall structure of proposed methodologies and a brief explanation are stated in Section 4, with the supporting analysis in Sections 5 and 6.The paper is concluded in Section 7, which also addresses several aspects regarding the study's potential future steps.

Preliminaries
This component of the study provides background information on the NLP approaches applied during TCG tasks and historical test execution reports mining.Initially, the fundamentals of NLP and NLG together with pertinent theoretical parts, are outlined.Afterward, the approaches utilized in this study for TCG are offered with a detailed explanation of why they are being adopted.In essence, NL corresponds to a language used by humans in daily conversation, such as English, Spanish, or Swedish.Such languages may be confusing, feature a wide, diversified vocabulary, and have words with many meanings; they may also be spoken with various accents.Despite this, humans can comprehend NL without much difficulty.NLP is a field that combines computer science, artificial intelligence (AI), and linguistics to investigate how computers could be used to interpret and arrange text or speech in NL in order to create useful applications (Ahsan et al., 2017;Garousi et al., 2020).Although NLP is witnessing significant growth, implementing NLP applications is still a difficult undertaking since contact with computers requires a precise and clear vocabulary, which NL lacks.The field of NLP combines natural language understanding (NLU) and NLG.With NLP, the text in a particular document could be analyzed and understood; however, NLU enables NL conversations (dialog) with computers.Furthermore, NLU is classified as an AI-hard problem, which means that a solution requires tackling the problem of developing general artificial intelligence: producing a machine that is as intelligent as humans (Garousi et al., 2020).
On the contrary, NLG differs from traditional computer-generated text as it allows computers to generate NL content automatically that matches how humans naturally communicate.The work in this study has considered NLG for automatically updating an SRS document regarding the requirements fetched from the user feedback documents.A standard NLP system is usually made of numerous processing stages, such as tokenization, morphological analysis, and syntax analysis (Figure 1) (Garousi et al., 2020).The objective of the task determines the levels that should be taken into account and implemented in a system.In this study, TCG relies heavily on tokenization, dependency parsing, and a keyword-extraction algorithm known as rapid automatic keyword extraction (RAKE).

Figure 1. A pipeline depicting typical NLP tasks.
Tokenization transforms the raw text into tokens or individual words based on some preset rules such as punctuation, whitespace, or dictionary (Ahsan et al., 2017;Garousi et al., 2020).Dependency parsing examines a sentence's grammatical pattern, i.e., it emphasizes the links between the words.RAKE attempts to identify significant phrases in the body of the text by assessing word frequency and co-occurrence with other terms in the text.For the mining of historical data, an NLP approach known as NER is utilized.Since tokenization and NER are wellknown within the software engineering industry, they have been employed by numerous NLPintegrated systems (Garousi et al., 2020).However, none of the extant software testing methodologies depends on RAKE or a hybrid of RAKE and a dependency parser for TCG.Section 2.1 describes dependency parsing in detail, whereas Section 2.2 covers the fundamentals of RAKE.

Dependency parsing
Dependency-based syntactic parsing is capable of separating sentences into several constituents and operates on the premise that there are direct connections between each linguistic unit (token) in a sentence (Garousi et al., 2020).For instance, the dependency parser may discover, for a particular user feedback, "I do not need or require any kind of stuff or anything," what token, when handled as a headword, has what related children with which kind of relationship (dependency) in between.In a typed dependency structure, directed arcs represent the concept of such relations between the headwords and their children.The result from the dependency parser is illustrated in Figure 2. Since a dependency parser may be able to identify up to 37 syntactic relations while processing a phrase, this study has, for the sake of simplicity, emphasized some of the most used dependency tags in Table 1.
The manner in which the knowledge from a dependency parser is used varies between studies.This includes identifying the traditional subject-verb-object relationships in sentences as well as determining how and which words are used to describe or alter the subject.However, in this study, the fundamental function of dependency parsing during TCG is connected to word associativity.To comprehend the user's new software requirements, the negation-based relations from the feedback paragraph are first recognized and then extracted.On the contrary, while interpreting test history reports, a greater emphasis is placed on numerical linkages.Other NLP techniques, such as POS tagging, would not be able to capture such information, which is why a subsequent step (dependency parsing) has been considered.

RAKE
For extraction of key phrases/words, an unsupervised, language and domain-neutral statistical approach, i.e., RAKE, is considered, which Stuart Rose et al. suggested back in 2010 (Rose et al., 2010).The creators of this algorithm largely adhere to the notion that punctuations (phrase delimiters) and stop words, as well as other types of words with minimal lexical meanings, are scarce in the keywords.This indirectly suggests that RAKE generates important phrases through the use of word collocation and co-occurrence.In particular, RAKE uses a list of stop words (a, an, the, of) and phrase delimiters to partition the text into content/candidate phrases and then applies Apart from RAKE, there are a few other unsupervised extractive algorithms, such as TextRank and Yake (Rose et al., 2010).These algorithms are different regarding the pre-processing approach they adopt and the characteristics they consider to capture the lexical units.If the performance analysis between RAKE and TextRank is to be believed, the latter generates more unigrams (single words) as extracted information.However, the results of RAKE are more of bigrams and trigrams (phrases) (Figure 3).In addition, TextRank's incapacity in detection and extraction of relevant terms from the text could be seen in Figure 3. Yake, on the contrary, employs a set of five statistical features that are heuristically integrated to identify and allot a score to essential phrases.These phrases/words according to their scores appear in ascending order, as the lower ones are considered to be the most significant ones.However, for the current TCG task, this could lead to deceptive information as the most crucial phrases/words occasionally could be at lower or medium levels with highest or medium scores (Figure 3).Regardless, when less duplication among the retrieved information by RAKE, TextRank, and Yake is considered, the former provides more competent findings than the latter two (Figure 3).

Related study
In this component, the relevant work is discussed from two perspectives, beginning with a nonexhaustive summary of the research works that have processed requirement-oriented texts with NLP for TCG.Following that, the studies that have opted for the execution history factor during test case optimization are provided.

On the use of requirement-specific documents for TCG
Any stated software requirements bridge the gap between the developer and the client regarding what is expected to be developed and how the developer will address those expectations.Dwarakanath and Sengupta (Dwarakanath & Sengupta, 2012) employed written requirements, notably functional requirements (FR), for the TCG task.In their methodology, each reported FR is treated as a separate sentence and is fed to their proposed instrument, entitled "Litmus."The authors have integrated an NL syntactic parser and a variety of pattern-matching rules in Litmus to examine each sentence's testability and to recognize the test intents, and then finally generated the test cases (both positive and negative) out of each sentence.Ansari et al. (Ansari et al., 2017) also worked on FR; however, the NLP technique they utilized was not specified, and the test cases were merely the result of if-then-else-based information extraction from the FR document.
According to Carvalho et al (Carvalho et al., 2013), the standard of requirements is a critical factor for any productive testing process and should thus be stated using some (semi-) formal expressions such as software cost reduction (SCR) or Lustre.However, for some, such formalism's syntax and semantics would be a baffling theory.Therefore, the authors have adopted a different flow for TCG where they first analyzed the requirements written in controlled natural language (CNL) through syntactic and semantic analysis and generated the case frames.These case frames are translated into specifications and then finally to test cases via inclusion of SCR notations and the T-VEC tool, respectively.The work described in (Carvalho et al., 2014) is a continuation of (Carvalho et al., 2013), in which the authors analyzed their suggested technique (Carvalho et al., 2013) on many domains, including control and reactive systems.Aoyama et al. (2021).also contemplated the semi-formal representation of NLbased software requirements as an intermediate phase during TCG.A rule-based code infused with the POS of words and dependency relation data was applied to the specifications to express them in a format written in accordance with the extended-backus-naur form (EBNF).The authors have conducted continuous reviews while structuring the semi-formal descriptions and translating them into a propositional network in order to check the presence of defects and correct the same in specifications.The test cases were then depicted in the form of a decision table that was drawn out from the derived semi-formal description with the aid of PyEDA.Verma and Beg (2013) approached basic NLP processes, that is, POS tagging and parsing along with knowledge representation for the construction of test cases.Masuda et al (Masuda et al., 2016), through tree kernel methodology, computed the syntactic similarity among the requirement sentences to select the appropriate ones.The authors then discovered the conditions and actions from the sentences via a case and dependency analysis, which they later termed as test cases.Salman (Salman, 2020) suggested the usage of machine learning (ML) to map the semistructured test case specifications (TCSs) directly to executable C# test scripts.The author applied text pre-processing (tokenization, POS tagging, stop words removal) over specifications and transformed them into feature vectors through the bag-of-words (BOW) technique.Each feature vector was linked to the label vector (name of C# scripts).Later, the stated data was passed to the Linear SVC and KNN classifier for mapping of TCSs to suitable scripts.
Apart from the direct translation of NL requirements to test cases, there exists a category where the requirements are first converted into models, and then the test cases or suites are generated from them.Sarmiento et al. (Sarmiento et al., 2014) viewed the NL requirements in the form of scenarios and the application's pertinent phrases or words as lexicon icons.This entire set of information, along with scenarios, was converted into an activity diagram (behavioral model).In general, the activity diagram's paths were formed by the bonding between scenarios' episodes, constraints, and exceptions.The authors then employed the Depth-First and Breadth-First Search procedures to retrieve the components of a test case from the activity diagram, which included the test scenario, variables, and their constraints.Allala et al. (Allala et al., 2019) developed a requirement meta-model typically composed of unified-modeling-language (UML) class diagrams and object-constraint-language (OCL) norms.The authors took pre-existing user specifications, i.e., "use case + user story," in the structured NL manner and transformed them into a version that is coherent with the indicated requirement meta-model framework (source model).Afterward, the authors leveraged this source model to generate partial test cases (target model) using a series of NLP tag-based retrieval methods integrated with a transformation engine.Wang et al. (Wang et al., 2022) adopted use case specifications (UCS) and a manually-constructed domain model (class diagram) from UCS as major requirements.To avoid using behavioral models, the authors utilized a relatively easily interpretable form of UCS known as restricted-use-casemodelling (RUCM) and employed NLP to capture domain entities.They collected the OCL criteria from the revised domain model and utilized them in conjunction with use cases to build the use casedirected graph (test model) for each specification.The test model, together with OCL data, resulted in the creation of test inputs and scenarios, which aided in TCG.In order to facilitate the conversion of user stories into test cases, Fischbach et al. (Fischbach et al., 2020) implemented an intermediary layer.Using dependency parsing, the authors have transferred each retrieved acceptance criteria from the user stories into a dependency tree.Later, the authors scanned the dependency tree to yield cause-effect-graphs (CEG) that functioned as their study's test models.Similar to the above-mentioned studies, there exists a bunch of research works that have opted for model-based TCG (Gröpler et al., 0000;Nogueira et al., 2019).However, this has to be noted that whether the TCG task was specification-based as in (Dwarakanath & Sengupta, 2012), Ansari et. al. (2017), G. Carvalho (2013), G. Carvalho (2014), Aoyama, Kuroiwa and Kushiro (2021), Verma and Beg (2013), Masuda, Matsuodani and Tsuda (2016), (Salman, 2020) or model-based as in (Sarmiento et al., 2014), Allala et.al (2019), Wang et. al (2022), Fischbach et. al (2020), Nogueira et. al (2019), (Gröpler et al., 0000) (Table 2), the requirements that were utilized were either written in semi-formal format or formal format.Cases where even the semi-formal versions of requirements are converted into a language-specific model to alleviate the ambiguity issues of NL requirements could increase the testing resources since such formal models need expert knowledge to interpret.Apart from this, in the natural environment, the requirements portrayed by users are free from bounded expressiveness, and no such study has dealt with such free-form user requirements for TCG.To state such issues, the current work focuses majorly on devising a strategy based on NLP that could handle more unstructured and less bounded user requirements for TCG.

On the use of history factor during test case optimization
Research pertaining to test case optimization could be single-or multi-objective.Until now, distinct factors, including requirement coverage, statement/branch coverage, test case execution time, and numerous others, have been examined to achieve the optimization goal.However, it has been observed that among such criteria, the execution history of test cases has been taken into account by a wide range of studies.A test case's execution history or trace record indicates its behavioral pattern in the past, more particularly in terms of detecting faults and the rate and severity of identified faults.Kim et al (Kim et al., 2017).computed the weighted score for each test case by statistically evaluating the failure verdicts of test cases from earlier sessions and the flipped outcome of method test cases.The authors then re-ordered the test cases in an increasing manner depending on these scores.In the event of dynamic failure, the authors have ordered their reordered test cases sequence leveraging the correlation data.
Studies comparable to Kim et al. also exist in which historical knowledge was employed in either constant or variable format in equations to determine a specific sequence for test cases (Khalilian et al., 2012), (Fazlalizadeh et al., 2009), (Gupta et al., 2015), (Busjaeger & Xie, 2016).Rahman et al. (Rahman et al., 2018) retrieved the total count of faults reported by each clustered test case in earlier test sessions as past knowledge.The authors adopted clustering to categorize test cases depending on their function-call similarities across multiple versions of test suites.Thereafter, the past data, as well as the level of connectivity between test cases, were used as the foundation for internal prioritization.Vedpal and Chauhan (Vedpal & Chauhan, 2018) presented a list of parameters such as the seriousness of faults, the test case's fault detection capacity, and many more for priority purposes.As historical information, here also, the authors employed the count of faults discovered by the test cases along with the execution time of test cases from earlier sessions to provide a legitimate score to such factors.
On a significant scale, it has been seen in the existing literature that if the studies suggest the solution for test case selection or reduction or for re-ordering while relying on the history part of test cases, such history was exclusively failure-based.That is, the frequency with which a test case failed in the past, what was the relationship between two co-failed test cases, the execution cost and time involved with a test case (Anderson et al., 2014;Najafi et al., 2019), and the counts of faults detected by a test case in terms of severity (Padmnav et al., 2019), Noguchi et. al. (2015), (Huang et al., 2012).Along with failure history for test case re-ordering, Prakash et al. (Prakash & Gomathi, 2014) utilized diverse forms of coverage particulars (statement, branch, function, path) covered by the test cases as historical knowledge.Lima and Vergilio (Lima & Vergilio, 2022) and Spieker et al. (Spieker et al., 2017) slightly modified the manner in which the history knowledge was consumed during the optimization of test cases in a continuous integration environment.The authors maintained the current test execution data in the form of reward and feedback, which then served as a history for the next execution cycle.
From the studies mentioned above, it is clearly understood that historical data significantly influences test case optimization.Yet, the mapping or encryption of such records directly into some integer or Boolean values is something that is omitted in existing studies.It has been noticed that in every existing work, the history information is framed in the tabular structure that either consists of fail and pass verdicts of test cases or the values from some defined range for the factors such as the fault detection capacity of a test case, the severity of faults detected, the execution time of a test case and many more.In testing businesses, however, the history reports regarding a test case execution are actually an elongated description in NL.The depiction stated in the existing studies about the historical data could mismatch with the actual historical records consumed by testers in the natural testing environment.To address the aforementioned deficiencies, the current study focuses on generating an NLP-based solution that could visualize the encryption of long comprehensive test case history reports into a valid value or score that could further highlight the importance of executing a test case in further runs.

Proposed system models
The semi-automated procedure in Figure 4 offers a summary of the overall proposed TCG procedure.The entire flow commences with the elicitation of requirements from the feedback documents.Users' requirements in any format could be extensive or concise.Therefore, understanding which requirement is of utmost importance and should be processed first is crucial.The text of the requirement is filtered using the basic NLP processes that include tokenization, punctuation removal, lowercasing, and spelling correction (Step 1).Determining the extent to which the text should be filtered or pre-processed so that it should not lose its real information, the sequence of the stated tasks, and which specific categorization of them should be practiced was a challenge.The pre-processed requirement would come up in token format and is translated back into the paragraph structure via de-tokenizer.Tokenization and then detokenization of the exact requirement were necessitated because of the working nature of the spell corrector used during the preprocessing stage and the parser and extractor, respectively.
The acquired requirement paragraph from the de-tokenizer is analyzed first through the dependency parser (Step 2) and then through RAKE (Step 3).The first analysis unveiled the existence of dependent words that possessed a negation-based link with the headwords; however, the latter provides the significant phrases out of the requirement paragraph that directly emphasized where the user's specification led to in terms of the flow of the application.The retrieved particulars from steps 2 and 3 are encapsulated into a tabular structure through the utilization of an applicationspecific corpus as a knowledge database (Step 4).A similarity measure known as Jaccard similarity is applied to this tabular structure to gauge the frequency with which the users have stated the modifications or the requirement of new features in their specifications (Step 5).Jaccard similarity is basically a statistic that contrasts text documents or sample sets on the scale of similarities and differences.The highest frequency of required modifications is investigated, and in a case where more than one modification possesses the same frequencies, a list of parameters is computed until and unless a clear picture of which modification has to be processed appears (Step 6).
The final feature requirement or modification is then retrieved from the table (Step 7) and is passed to the NLG framework, where the user-defined modification data is framed into more technical and non-ambiguous statements.The SRS document is updated with these generated statements (Step 8) and is referred to for further updation in the application code (Step 9).Moreover, the updated SRS document and the application's source code are taken as information packets to document the in-depth details of each class or module of the application (Step 10).The original user requirements stated in the feedback documents are once again explored for understanding the points of faults or gaps in the application, and thus the count of faults found is mapped with the count of test cases to be formed.Eventually, the documented details from step 10 and the count of test cases would aid in constructing full-fledged test cases.
Another proposed model, portrayed in Figure 5, illustrates the flow opted for translating the past execution data of test cases into integer-driven values.The data from each test case's historical records are first read and then extracted through the combination of NLP and non-NLP processes.NLP process like NER is applied to those sections of the historical records that include data related to test team particulars; however, the dependency parser worked on fetching the details that are more linked with the failure or the coverage of the test case.Aside from this, the non-NLP approach entails directly identifying specific labels/fields within the historical records, followed by extracting the corresponding data.The entire set of retrieved information is mapped with the information maintained in the test-oriented database, and then preference values are estimated, which, when aggregated and linked to a specified range, generate the scores for test cases.

Case study: model 1
This study preferred a price-calculating system as a sample application that has been regressively tested and is in the user acceptance stage, to legitimize the working of the proposed approach.A series of "10" feedback documents are examined to assess acceptability and capture the user's thoughts, experiences, and requirements regarding the mentioned application.These documents are composed of a variety of queries, such as new upgrades or features that the user may require, complex and challenging locations for the user inside the application, the application's userfriendliness, and the amount of time the user has been working with the application.These queries are straightforward and are processed on a preference scale so that the tester can precisely capture the users' expectations (Table 3).

Preparatory stage (text purification)
The priority scale shown in Table 3 depicts the processing chronology of the queries.Since it is essential to know whether the features incorporated in the developed application are sufficiently accurate or whether there exist some other requirements that the user needs, the response regarding query "4" is extracted first.The text of the response might consist of words or expressions with absolutely no sense or context, such as punctuation marks, emoticons, misspelled words, and stop words and thus removing or correcting them from/in the text comes under the basic pre-processing requirements.However, stop words such as "not" have been observed to contribute to the understanding that the user might not need any new modification in the application or that the modification is not associated majorly with the prime features of the application.Aside from this, it could be pre-assumed that the clients using the application could have a moderate knowledge and understanding level, and thus the likelihood that the text could contain emoticons is almost negligible.
To conduct the text purification tasks, this study has relied on conventional python libraries such as re, NLTK, pyspellchecker, and jamspell.The text is initially checked for the existence of punctuation marks through the re (or regular expression) library (Figure 6).This library uses a set of special characters that are held in a specialized syntax or pattern to search, match or replace strings or a set of strings.
When writing, humans often capitalize words to draw emphasis to a particular word or a set of words; however, processing a combination of capital and lowercase words in a sentence might treat the words differently in subsequent NLP tasks such as parser.Therefore, the words in the punctuation-free text are thoroughly translated into lowercase (Figure 6).Before approaching for spelling correction, the low-cased text is converted into the token format as the spellchecker module of the pyspellchecker library consumes the input as a list of individual words.The WhitespaceTokenizer class of the NLTK (or natural language toolkit) library is employed for tokenization that splits the words in a sentence on space, tabs, or newline.NLTK constitutes a range of algorithms and packages for text-handling tasks such as semantic inference, POS tagging, tokenization, parsing, and numerous others.
In any case, the detection and restoration of misspelled words could be achieved through the use of either libraries, i.e., pyspellchecker or jamspell, yet, the current work is dependent on the amalgamation of these two.The rationale is that jamspell considers the word surroundings and provides the most appropriate and diverse candidates/corrections for diverse contexts compared to pyspellchecker; howbeit, it requires the position of the misspelled word to generate accurate corrections.This particular data cannot be supplied to the jamspell as assuming which particular word in the text could be mis-spelt or does the mis-spelt words even exists in the text is harder.Therefore, pyspellchecker is utilized that primarily yields the information regarding the mis-spelt words, and then jamspell provides a list of the best possible corrections.Despite structuring and exploring a framework for spelling correction, this particular step needs human intervention.Cases, for instance, the one portrayed in Figure 7, need the engineer's intelligence and approval in suggesting the accurate selection.

Working with parser and extractor
To comprehend the syntactic dependencies, the text could be parsed with several python libraries such as NLTK with Stanford CoreNLP, Stanza, and spaCy.Among these, spaCy is considered as it possesses a quick and efficient syntactic dependency parser with an option to visualize the dependencies through the dependency tree.After the noise removal, the text is parsed to identify the negation modifier dependency tags since such links could help in knowing the users' exact specifications regarding the application.However, to which extent such links would exist in a text is unpredictable, and extraction and investigation of each of such links could make the procedure    Since in a sentence, a verb is a sole concept that determines the action or an event, and there exist instances where a verb is occasionally treated as a noun or vice versa in a sentence, the stated words in the list belong to the application-centric noun, verb, and adjective categories.The flow of how the user responses associated with query "4" are parsed and then processed through RAKE is demonstrated in Figure 8(a,b) and 9, respectively.The extracted data shown in Figure 8(a) would only be displayed and utilized if the text contained any negation-linked dependency; otherwise, in case of nonnegated responses, the generated parsed data would be worthwhile in filling up the omitted entries while framing the final acquired modification data.In Figure 9, the numeric values, following which key phrases appear in descending order, illustrate the scores generated by RAKE.The detailed procedure behind how these scores are generated and how come a particular set of words are displayed as significant key phrases by RAKE is portrayed in Figure 10(a,b).
In Figure 10(b), deg w ð Þ denotes the total times a word co-occurs with another word in the text, while freq w ð Þ denotes the occurrence of a word in the text.The value of deg w ð Þ is computed from the data illustrated in Figure 10(a).Once each response regarding user query "4" is parsed and processed by RAKE, the acquired information is mapped with the corpus and stored in a tabular framework (Table 4).The stated corpus actually holds the catchphrases and terms associated with labels, where the labels point toward the information linked with the internal structurization of the price-calculating application.Basically, users could state their requirements without following the exact grammatical structure, and thus understanding what exactly the users want in an application could be a bit tricky for machines, especially when machines are reading and processing those requirements without human intervention.Thence framing of the acquired data, as shown in Table 4, is necessitated for better apprehension of users' requirements.

Investigating similarities and breaking ties
On to the feature_idn column of the data portrayed in Table 4, Jaccard similarity/index/coefficient is applied.The reason behind applying the similarity measure to this specific data column is to address the most asked or needed requirement immediately.Other data columns might be able to provide insight into the requirement, but that necessarily would not be focused on what major clients or users want.This similarity measure primarily works on the intersection and union rules and applies them to the tokens to generate a numeric value of similarity.For instance, if Jaccard similarity is applied to the first two cells of the feature_idn column shown in Table 4, then the generated similarity value would be 0.33333.To get the exact match and the count of that match, this similarity process is tracked for the numeric value "1" (Table 5).
As depicted in Table 5, a tie situation was encountered while capturing those cells that possess the similarity score of "1" with the highest count.Since it could be argued that there exist only two major modifications stated by users, the developer or the engineer could effortlessly integrate these into the code, and the tester could test them in no time.However, with sizeable code, the extracted data illustrated in Table 5 might become massive, and handling such data or requirements in an environment where time and cost are already substantial restraints could make the application's update and testing process stressful for the developer and the tester.Therefore, the acquired requirements (Table 5) are further narrowed via tie resolution.
Other than query "4", in the feedback document, there exist three more queries whose priority values could be referred from Table 3.Following that, query "1" highlighted the most consequential parameter associated with the user's experience.Understanding which users have been associated with the application with what amount of time could aid in knowing the stability of the acquired requirements.The experience of each user following the feedback doc_no.column stated in Table 5 is extracted through the python libraries numerizer and re, and an average is computed.For the feature with the "Gst requirement" label, the average user experience appeared to be 6, i.e Additionally, the same score appeared for the "Pay requirement" feature, i.e. 6: Such a situation triggered the investigation of query "2" that relates to the user-friendliness parameter.
An application's user-friendliness reflects how the site or application is presented to the user and whether the application is straightforward to manage and interpret.If an application is user-friendly, it signifies that a greater number of people have accessed it.Given the Series_idn data (Table 5), which contains the application's module number and is linked to the Feature_idn data, it would be practical to determine which module is the most frequently visited users.That is, understanding the more user-friendly module first and then upgrading the new feature required there would be more acceptable than repeating the process with the rarely user-friendly one.This notion might have its roots in the fact that the TCG utilized in the current investigation is more concerned with new modifications rather than with the refurbishment of GUI (graphical user interface) components.
Each feedback document is explored for the user response regarding query "2".The text of the response is pre-processed in a similar manner as that of the response to query "4".The preprocessed text is mapped with information available in the corpus to reveal the less frequent and, thus, the most-accessible module of the application.The data relating to the acquired module is then snipped from the data shown in Table 5.

Updating formal specification document and finalizing TCG
A detailed description of what an application or software will do and how it is anticipated to function is technically documented in an SRS.This is the document that not only provides a thorough depiction of the entire application in terms of FR or non-FR to be covered but also keeps the development team in sync.Before implementing any new changes to an existing application, the accompanying SRS must be updated.The engineers would then refer to this updated SRS while modifying the application's source code.For such a process of upgradation, this study has considered a concept of template-based NLG.The template-based methodology generates text automatically using preset templates.These templates are developed by humans and include data placeholders that could be populated to create an entire phrase or paragraph.
The snipped data acquired from the similarity detection and parameter estimation stage is allowed to settle in the template structure shown in Figure 11.For some readers, the specification generated by this template-based methodology might look like canned text.However, from the perspective of the testing environment, the text or details present in an SRS document obey a specific style and choice of vocabulary, and thence such format could only be achieved through template-based NLG.With SRS updation, the source code and, thus, the class expansion document are parallelly updated by engineers.A class or a module expansion document basically involves the critical details of each module or class embedded in the source code, for instance, the list of attributes or inputs with their data types, the list of independent paths with the constraints in each path, and many more.This set of information cannot be derived from the SRS document; thus, a class/module expansion document is separately maintained.In case this document didn't exist before then, it must be constructed after the upgradation procedure.Till now, through the proposed flow, it is understood what requirement the user needs according to which the source code has to be updated and how the updated source code and SRS document would aid in documenting the details of each class or the module inside the code.However, how the test cases would generated from this data is still unknown.Essentially, the class/module expansion document supplies some information for the development of test cases; nevertheless, it is unknown how many minimal test cases should be produced to support the operation of the new modification.This prerequisite, in turn, highlighted the importance of investigating the existence of faults or difficult-to-access sections in an application from the user's perspective (query "3").The responses regarding query "3" are pre-processed using the libraries stated in section 5.1, and then using the information from the database, the module-related data is extracted from the responses and mapped in a dictionary format (i.e.{"F1": "-", 'F2': 'M1', 'F3': 'M2', 'F4': 'M1', 'F5': "-", "F6": "M1", "F7": "M2", "F8": "M1", "F9": "M1", "F10": "M1"}).
The frequency of each module in the dictionary states how hard they were for the users in terms of accessing or incorrect working of a particular feature.Since the modification which is being fetched from the raw user responses indicates module "M1", the whole focus is shifted to its frequency in the dictionary.This theory could be supported by the fact that before generating the test cases for a new feature or a component, the tester should be aware of the fact that how many users have faced issues in the same module where the new feature is being integrated.This knowledge would not only decide the new test cases' count but also shift the partial focus of test cases' working on validating the faults.A range is being fixed to enumerate the count of test cases to be generated; for instance, a range of [0-6] would depict a count of 4 (Ansari et al., 2017;Aoyama et al., 2021;Carvalho et al., 2013Carvalho et al., , 2014;;Dwarakanath & Sengupta, 2012;Garousi et al., 2020;Lou et al., 2019;Masuda et al., 2016;Rose et al., 2010;Salman, 2020;Sarmiento et al., 2014;Verma & Beg, 2013;Wang & Zeng, 2016), as 8, and so on.For the current case, the frequency of "M1" is noted to be 6, and thus four new test cases are generated to ensure faults and the functionality of new features within the existing application.The tester eventually could construct absolute test cases by utilizing class/module expansion details and the data regarding the count of test cases (Figure 12).A historical record, in a testing environment, includes the data in the form of manually or automatically generated reports that indicate the of the test cases in past executions.These reports generally could have different formats for storing the execution data, and thus the amount of data could also differ (Figure 13).In order to demonstrate the viability of proposed model 2, this study utilized the same price-calculating application.Due to the fact that this application has already been regressively tested with 30 or more test cases in the past, there is a vast history of the execution of these test cases.For the current run, these 30 test cases were already optimized, and new test cases were generated to accommodate the modifications.This situation clearly illustrates the concurrent environment that testing industries use, due to resource constraints, for optimizing test cases and integrating updates and modifications into applications.In subsequent runs, historical data from a set of 25 test cases will be examined for optimization.

Data retrieval from history reports
Each and every particular mentioned in the history report shown in Figure 13 is important.However, there exists some set of particulars that could contribute to weighing the past performance of a specific test case.The data from the historical reports are first read through the language-specific libraries (in this case python-docx library is utilized), and then essential sections within the reports are investigated.
Apart from the sections that reveal the fail/pass status of a test case or the requirements that it covered in the past, roles and responsibilities is one such section that is discovered to be far most important in a history report.This section could contain critical information related to who from the test team with how much experience in the testing field has handled the creation to the execution phase of a test case, how many constraints are being targeted with a test case, how many potential faults are detected in the case of a test case failure and many more.To extract such important particulars, this study has deployed spaCy's NER model and dependency parser (Figure 14).For the detail that resides in sections such as incident number, results, requirements to be tested of history

History value generation via database mapping
To enumerate a value that could highlight the historical importance of each test case, a list of parameters is considered.This list is directly proportional to the data extracted in section 6.1.A set of six parameters including the tester's grade, module's cost, fail/pass status, independent path's cost, coverage and importance of requirements, the position of the test team members (except the tester) involved in the process along with the crucial data they identified, are evaluated.Parameters such as module's and path's cost indirectly point towards each statement's significance in the source code.Testing industries always maintain a database that keeps the particulars of each and every member involved in the application's development and testing task.This study has utilized the information from such a database to evaluate a priority value for parameters related to the grade of the tester and other test team members.An aggregated score of each of the stated parameters for a specific test case is mapped to a range of [0-2] to lessen the complexity issues while utilizing them for optimization objective (Table 6).

Discussion and final analysis
Using NLP, this study proposes a solution to generate requirement-based test cases and to transform historical data of test cases into numerical scores.As part of the proposed methodology for TCG, a number of pre-processing tactics, including punctuation removal, tokenization, change cases, and others, are used to eliminate noise from the text.The pre-processed text is examined from two angles.The former incorporates a dependency parser, whereas the latter employs a RAKE method.The details received from the NLP pipeline are improvised with a statistical similarity computing measure to discover the desired and most reasonable modification from the user requirement documents (in this case, feedback documents).In subsequent steps, template-based NLG is utilized to update the SRS document with the acquired modifications, which in turn improved the details of class/module expansion.Such details assisted the development of test cases.For historical data mapping, a NER model, along with a parser, is employed.Ultimately, the work in this study is supported with a benchmark case study.
Users are noted to have a wide perspective when documenting requirements, especially when it comes to the casing of words.This is why this study considered low-casing the text of the requirements while pre-processing it since the parser could generate different grammatical tags and dependencies for the same word with different casing styles.Consequently, the proper casing of the text plays a prime role in determining how the information will be processed and displayed.The fetched requirements through the NLP process undergo a similarity check.While exploring, this study found Jaccard and Cosine as the two most popular and widely applied similarity-based metrics.However, when applied, it is found that the Jaccard similarity measure generates more clear and precise results on the fetched requirements than the Cosine.This is due to the negligence of repetitive words.
It could be argued that a simple way of similarity detection could be practiced instead of such statistical measurements; however, those would not work in the case when requirements or features that possess maximum similarity would be approached.This study has centered its methodology on lowering human efforts in terms of the testers constructing the test cases out of requirement documents and the users who can document their requirements in free-form   English sentences with no constraints on expression.Thus, this objectivity is maintained in each step of processing so that more accuracy could be gained in terms of constructing the right test cases for the most appropriate user requirement and with fewer obstructions when the tester would handle the process.The choice of template-based NLG adheres to this theory.There exist a ton of deep learning models, including neural network template-based NLG, that could be taken into consideration; however, deciding the ratio of training and testing data and tuning of hyperparameters could sometimes be traumatic.
Additionally, the whole idea of imposing automatic developer-like updation to the SRS document after considering the fetched requirement as input could only be achieved through the template approach since the terminologies and how the specifications are described in the testing environment differs from the human-like text that these deep learning models are trained on.Adaption to changes in the testing industries is so frequent that if such models are trained with domain-centric data also, then maintaining them with the pace of changes would be difficult.One minor downside of the designed templates in this study is that they are not generalized and may vary depending on the application.However, the effort required in designing such application-specific templates would be less as compared to that of maintaining the deep learning models.
To allocate a specific label to the key phrases extracted through RAKE, this study relied on knowledge derived from the corpus.With this study targeting users with moderate understanding, the information residing in the corpus was relevant for key phrases extraction.However, in the future, if well-experienced stakeholders are targeted, synonyms and abbreviation rectification should also be incorporated during text pre-processing for internal validity with the database.This study may also lead to full-automation and tool-making in the future.Currently, the need of the industries is an automated tool that could 1) generate requirement-based test cases, and 2) assign a past execution score to the test cases so that the level of criticality associated with executing a test case could be known to a tester while evaluating it.

Conclusion and future scope
Considering the format-boundedness foisted on the expressions/requirements of external users and the usage of language-specific test models while translating them into test cases, this study has proposed an NLP-based framework (Model 1) that prefers unstructured user specifications and translates them into thorough test cases via NLP methodologies, a statistical similarity measure and template-based NLG.Later on, considering the wide importance and usage of historical data of test cases during optimization and the lack of stable and relevant measures to weigh it, this study suggested a NLP-based solution (Model 2) for automatic translation of historical particulars of test cases into numerical form.
The case study for Model 1 demonstrated that despite of diverse arrangements of words and the constant threat of ambiguity in the users-stated requirements, Model 1 attained an accuracy of 99% in addressing the requirements of users and translating them into test cases.The 1% deterioration could arise due to non-breakable or puzzling tie situations during parameter estimation stage.In contrast, the validation of Model 2 with a case study indicated the practical benefits of it to testing industries.Currently, the proposed models, along with their integrated NLP strategies, represent the initial endeavor to incorporate and leverage the beneficial aspects of NLP in the regression testing domain.This paves the way for further exploration of potential future directions.
In Model 1, it would be worthwhile to investigate the efficacy of widely-used key phrase extraction methods like KeyBERT compared to RAKE across a diverse set of user-stated requirements.Ensemble methods could be considered as an alternative to dependency parsing to determine if extractors are necessary, or if parsing alone can sufficiently retrieve the required information from the requirements.Furthermore, in accordance to the testing environments need, the feedback documents could be enhanced by incorporating additional elements such as the complexity, functional glitches faced by the clients.This could also be done in the cases when complex applications related to wireless sensor network (Ashraf et al., 2020), supply chain management (Ashraf et al., 2020), or IoT healthcare (Ahmad et al., 2020) are under test.Since these applications could possess sizeable and complicated modules, the clients' requirements in turn could be complex, wide, ambiguous and could have confusing interdependencies.One prospective avenue for future exploration could involve validating the proposed models on such complex applications.Apart from this, it would be beneficial to assess the performance of Model 2 on a broad range of historical reports encompassing various formats, sections, and levels of detail.Last but not the least, full automation for Model 1 could be practiced.

Figure
Figure 2. Diagrammatic view of a sample dependency tree.

Figure
Figure 3. Benchmarking RAKE with the existing and most acceptable unsupervised keyword extraction algorithms.

Figure 5 .
Figure 5.The proposed automated structure for concealing the historical data of test cases into numeric values (model 2).

Figure
Figure 4. Block representation for the proposed TCG structure (model 1).

Figure
Figure 6.Showing the preprocessing flow for a sample response text linked to query '4'.

Figure 7 .
Figure 7. Exemplar view of out of-domain corrections suggested by Jamspell library for a sample misspelled word.

Figure
Figure 9. RAKE capturing critically important phrases and words from the sample response.

Figure
Figure 8.(a) examination of sample response via dependency parser, (b) partial visual of the dependency tree generated for the sample response outlined in part (a).

Figure
Figure 10.(a) word-degree matrix constructed through RAKE for a response "as I mentioned, I need a total tax breakdown or additional gst feature in the application."(b) condensing matrix data to compute the final scores of phrases and words.

Figure 11 .
Figure 11.Probable engineerlike templates suggested for framing the extracted requirement into sentence structure.

Figure 12 .
Figure 12.Visualizing a new test case after generation.

Figure
Figure 13.A sample template for a test case's history report.

Figure
Figure 14.Showcasing the detailed NLP-oriented flow for a piece of sample information residing in the history report of a test case.