Evaluation of competency methods in engineering education A systematic review

The purpose of this systematic review is to evaluate the state-of-the-art of competency measurement methods with an aim to inform the creation of reliable and valid measures of student mastery of competencies in communication, lifelong learning, innovation/creativity and teamwork in engineering education. We identi ﬁ ed 99 studies published in three databases over the last 17 years. For each study, purpose, corresponding methods, criteria used to establish competencies, and validity and reliability properties were evaluated. This analysis identi ﬁ ed several measurement methods of which questionnaires and rubrics were the most used. Many measurement methods were found to lack competency de ﬁ nitions and evidence of validity and reliability. These show a clear need for establishing professional standards when measuring mastery of competencies. Therefore, in this paper, we propose guidelines for the design of reliable and valid measurement methods to be used by educators and researchers.


Introduction
Over the last 20 years, accreditation boards and educational stakeholders worldwide have emphasised the importance of integrating transversal competencies in engineering education curricula in order to prepare students for the engineering labour market (American Society for Engineering Education 1994;Engineering Accreditation Commission 2000;UNESCO 2010). Transversal competencies were first defined by Care (Care and Luo 2016) as 'skills, values and attitudes that are required for learners' holistic development and for learners to become capable of adapting to change' and are also known in the literature as employability skills (Markes 2006), generic skills (Bennett, Dunne, and Carré 2000), key competencies (Organisation for Economic Co-Operation Development 2005), non-technical skills (Knobbs and Grayson 2012), non-traditional skills (Crawley et al. 2007), professional skills (Shuman, Besterfield-Sacre, and McGourty 2005), soft skills (Whitmore and Fry 1974), transferable skills (Kemp and Seagraves 1995), and twenty-first century skills (Council 2013).
The growing emphasis on transversal competencies in engineering education has triggered the need to create robust methods that measure transversal competencies (Shuman, Besterfield-Sacre, and McGourty 2005). However, assessing students' level of mastery in transversal competencies is difficult, caused in part by a lack of consensus on the definition of the transversal competencies between the different engineering education communities, government bodies, and employers, and by what behaviours would exhibit mastery (Shuman, Besterfield-Sacre, and McGourty 2005). In addition, it is also difficult to assess transversal competencies independently, because they are often intertwined with the technical competencies (Shuman, Besterfield-Sacre, and McGourty 2005;Badcock, Pattison, and Harris 2010). These issues have hindered the development of competency measurement process.
This work is part of an Erasmus+ Knowledge Alliance known as the PREFER project that aims to improve the employability of future engineers. Within this project, we are developing curriculum elements that assist students in developing transversal competencies in communication, lifelong learning, innovation and teamwork. To evaluate the effectiveness of curriculum elements stimulating these competencies, we have reviewed the competency measurement methods present in engineering education literature.
With this review, we aim to inform the creation of reliable and valid measures of student mastery of competencies in communication, lifelong learning, innovation/creativity and teamwork in engineering education. To do so, we look at methods which are used 1) to evaluate course and programme effectiveness to enhance the quality of teaching and student learning, 2) to assess students' performance with the purposes of giving summative grading and/or formative feedback, and 3) to measure students' abilities in order to characterise student populations.
The following research questions were addressed in this review: (1) What are the methods used to measure the competencies: communication, innovation/creativity, lifelong learning and teamwork? (2) Are validity and reliability measured in the studies considered, and if so, which techniques are used? (3) What is the purpose of the measurement used in the study? (4) Which criteria are used to assess these competencies?

Background
This section provides the reader with the motivation for the selected competencies in study. This selection has been carried out using scientific and industry literature and within the confines and scope of the PREFER project.
The need to focus on transversal competencies in the engineering curricula was first highlighted in 1996 by McMasters and Matsch (McMasters and Matsch 1996) in the Boeing list of 'Desired Attributes of an Engineer'. This list required engineers to have good communication skills: written, verbal, graphic, listening, ability to think both critically and creatively, curiosity and a desire to learn -for life, and profound understanding of the importance of teamwork (McMasters and Matsch 1996).
Further emphasis on competencies such as communication, working in teams, and lifelong learning was given by the new ABET Engineering Criteria which came into effect in 2000 (Engineering Accreditation Commission 2000) and the Washington Accord (American Society for Engineering Education 1994). Similarly, in Europe after the Bologna process, which started in 1999, the European Network for Engineering Accreditation (ENAEE) has set these three competencies as an important part of engineering programmes.
A resulting engineering education initiative, called CDIO (Conceive, Design, Implement and Operate), which started in 1997 at MIT and is now a worldwide initiative, has developed a list of competencies which include creative thinking, curiosity and lifelong learning, multidisciplinary teamwork, and communications (Crawley et al. 2007).
In summary, we chose to limit ourselves to the competencies of communication, teamwork and lifelong learning, as the comparison of the competencies present in all the previously mentioned literature that showed agreement on the importance of these three competencies.
A fourth competency, innovation/creativity, was added within the framework of the PREFER project and was taken from the list of 'Great Eight Competencies' (Bartram 2005), a validated tool available and used in this project. This competency was found to be important based on the outcomes of a large industry consultation by another PREFER project partner (Craps et al. 2018). Considering the challenges of technology in the future, this competency is acknowledged essential for engineering students not only by the PREFER project but also by the wider engineering education community (Badran 2007;Crawley et al. 2007;Cropley 2015;Kamp 2016).

Methods
In this section, we describe the data collection methods used to carry out the systematic review (summarised in Figure 1) and report on the characteristics of the studies found.

Data collection
This review has been carried out based on the methods outlined in the practical guide on systematic review of Petticrew and Robert (Petticrew and Roberts 2006). Following this method, first, the research questions were framed, as stated in the section of the introduction. Next, the databases were chosen and the research terms defined. The research was carried out in October 2017 using three databases: ERIC (education indexes), Scopus (science, technology, medicine, social sciences, and art and humanities indexes), and Web of Science (sciences, arts, and humanities indexes). The following keywords: communication, innovation, creativity, lifelong learning, life-long learning, teamwork, or collaboration, in combination with measure, assess, method or evaluate and engineering were used in each of the three databases. In addition, controlled library terms (see the PRISMA diagram in Figure 1) were used after applying the keywords to filter the relevant studies. The research was limited to English language studies in peerreviewed literature, scientific journals, and conference proceedings from 2000 to 2017. The choice of the year 2000 as the starting point reflects the introduction of the ABET criteria for engineering programmes in that year (Shuman, Besterfield-Sacre, and McGourty 2005). Within these parameters, 332, 391, and 349 studies were identified in Scopus, Web of Science, and ERIC, respectively. From these studies, eighty-five duplicates were removed, resulting in 987 studies to be considered.
The third step of the method was to formulate the inclusion and exclusion criteria. To be included, the study: . Was performed on engineering students in higher, tertiary and postsecondary education. Studies on primary and secondary education, training of practising engineers, and non-engineering programmes were excluded. . Looked at at least one of the selected competencies: communication, innovation/creativity, lifelong learning and teamwork. . Reported on methods used to measure students' performances (i.e. grading and feedback), to evaluate course and programme outcomes, and to measure students' abilities in non-related courses. . Reported its aims and research questions, contained an adequate description of the data (country, participants, etc.), and provided answers to the research questions.
The first author examined the titles and abstract content of the studies found against the first two criteria. Then, the same author scanned the full texts (110 studies) against the last two requirements. Studies that did not fulfil the criteria were removed from this study.
From this analysis, 99 suitable studies were identified and managed using an EndNote TM citation database.
To answer the research questions, data about the measurement criteria, the methods used to measure each competency and the purpose of the measurement (1-students' performance for formative and summative assessment, 2 -evaluation of course/programme effectiveness and 3 -characterisation of students' abilities) were extracted. In addition, the first author screened the studies to search for the use of the main types of validity and reliability measurements, as recommended by Cohen (Cohen, Manion, and Morrison 2007): content validity, construct validity, reliability as stability, reliability as equivalence and reliability as internal consistency. These data were recorded on a data sheet.

Study characteristics
When looking at the characteristics of the studies, only 17% of the studies were published between -2009, compared to 83% published between 2010-2017. The analysis of the geographical spread of the studies shows that the most studies (64%) on competency measurement originated in North America, followed by Europe (19%), South America (7%), Asia (5%), Australia (3%), and Africa (1%). Moreover, 75% of the studies looked at only one competency (see Figure 1). Only 2% of the studies (Moalosi, Molokwane, and Mothibedi 2012;Narayanan 2013) looked at all four competencies. Communication was the competency which was most frequently studied (44% of the studies), followed by teamwork (36%), lifelong learning (29%) and innovation/creativity (25%).

Results
The findings of the systematic review are structured to address the research questions. Firstly, the type of methods used in the studies to measure competencies is described, as well as their advantages and disadvantages. Secondly, valid and reliable methods found in the literature studies are presented. Finally, we report on the best methods per research purpose and per competency according to their advantages and disadvantages, and the validity and reliability of the measurement methods reported.

Type of methods
In the studies analysed, seven different measurement methods were found: questionnaires, rubrics, tests, observations, interviews, portfolios, and reflections. Questionnaires and rubrics are the most common (75%) assessment methods reported.
Questionnaires, which gather information from respondents through a set of written questions, were used in the form of self-assessment, where students assessed their own perceptions about their skills (Strauss and Terenzini 2005;Garcia Garcia et al. 2014) and attitudes (Douglas et al. 2014), or peer assessment, where students assessed each other (Zhang 2012). While questionnaires are easy to develop and require short time administration, questionnaires reported perceptions which are predisposed to bias (Douglas et al. 2014). Another issue observed was that the majority of the questionnaires used Likert scale questions and were performed at only one point in time, therefore ignoring the effect of social and process changes. To take into account this effect, some studies used pre-and post-questionnaires (Waychal 2014;Gerhart and Carpenter 2015;Ngaile, Wang, and Gau 2015), administered at the beginning and at the end of the programme or course, which allowed for observing changes in student competencies.
Rubrics, scoring methods with or without detailed descriptions of levels of performance, were used by faculty (Gerlick et al. 2011) or industry representatives (Hotaling et al. 2012) to assess written reports and oral presentations, designs projects, and capstone courses. Rubrics with detailed descriptions of levels of performance homogenised and guided the assessors (Flateby and Fehr 2008;Scharf 2014;Eichelman, Clark, and Bodnar 2015) because they increased inter-rater reliability and minimised subjectivity of the competency measurement process (Fila and Purzer 2012).
Tests, in the form of written and proof of concept tests, besides questionnaires and rubrics, were frequently used to measure innovation/creativity. Similar to questionnaires, they were administrated to measure skills or abilities, either after the course (Charyton, Jagacinski, and Merrill 2008;Charyton et al. 2011) or before and after the course (Shields 2007;Robbins and Kegley 2010). As with questionnaires, the use of pre-and post-test were considered good strategies to ensure the validity of the method (Cohen, Manion, and Morrison 2007).
Observations, which intended to observe student behaviour, were used as a stand-alone methodology to measure students behaviour by teaching assistants (Sheridan, Evans, and Reeve 2014) or peer-students (Pazos, Micari, and Light 2010), but also in combination with other methodologies, e.g. interviews (Dohaney et al. 2015). As a good practice, most of the observations were carried out using frameworks or rubrics to guide the measurement.
Interviews, in which an interviewer asks questions to an individual or group of interviewees, were also used as a stand-alone (Dolan et al. 2011), but mostly in combination with other instruments such as questionnaires (Barnes, Dyrenfurth, and Newton 2012;Dunai et al. 2015;Eichelman, Clark, and Bodnar 2015). Both observations and interviews are time-consuming for assessors and they require training, however in the case of observations they provide authentic student behaviour and attitudes, and interviews allow depth and flexibility of student responses. An alternative to common observations used by (Besterfield-Sacre et al. 2007) is work sampling observations. This type of observation takes place in floating-length intervals instead of full-time observation. This method, used to measure teamwork in four different learning environments, reported improvement in the cost-effectiveness of the observation method.
The least used methods were portfolios (Martínez-Mediano and Lord 2012; Wu, Huang, and Shadiev 2016) and reflections (Bursic, Shuman, and Besterfield-Sacre 2011). The portfolios consisted of a compilation of deliverables developed by students as part of their coursework, that shows meaningful learning. The data of portfolios were coded to demonstrate students' recognition of the need for and ability to engage in lifelong learning (Wu, Huang, and Shadiev 2016) and to measure the influence of a Moodle learning platform on students' creativity (Martínez-Mediano and Lord 2012).
Reflections included students reflecting on and describing their competency learning of a competency. Portfolios and reflections, as well as observations, were used to support the results obtained by other methods, such as tests, rubrics and questionnaires. We suspect that the low frequency found of these methods can likely be explained by the relatively large amount of time and work required by faculty members to use these instruments. The use of multiple methods was also reported in other studies present in the review. This is discussed in more detail in the next sections.

Validity and reliability
More than half of the methods presented in the 99 studies did not describe the theoretical background or research behind their metric designs. Only 39 studies (32 measurement methods) went beyond that and reported validity and reliability properties (Appendix E). Of these studies, 7 measured communication, 6 lifelong learning, 6 teamwork, and 9 measured innovation/creativity. Only 4 methods measured more than one competency: communication and innovation/creativity (Hernandez-Linares et al. 2015), communication and teamwork (Immekus et al. 2005;Fini and Mellat-Parast 2012), and communication, lifelong learning and teamwork (Strauss and Terenzini 2005).
On the one hand, in some studies a number of techniques were used to demonstrate validity: review of items or content from previous literature; review of experts and students' opinion about the content of the assessment; correlations between tests which intend to measure the same construct; use of control and experimental groups; confirmatory and factor analyses; and testing of the method as a pilot study. Reliability properties relied on internal consistency and inter-rater reliability. On the other hand, validity and reliability measurements were overlooked in other studies, i.e. they did not define the content being measured which immediately violated the definition of content validity.
It was also found that methods which presented reliable and valid measurements in previous studies were reused, such as Modified Strategies for Learning Questionnaire (Lord et al. 2011;Amelink et al. 2013), Abreaction Test for Evaluating Creativity (Clemente, Vieira, and Tschimmel 2016), Critical Thinking Assessment (Vila-Parrish et al. 2016), Index of Learning Styles (Waychal 2014), Torrance Test of Creativity Thinking (Shields 2007;Robbins and Kegley 2010;Wu, Huang, and Shadiev 2016), Lifelong Learning Scale (Kirby et al. 2010;Chen, Lord, and McGaughey 2013), and Self-Assessment of Problem Solving Strategies (Douglas et al. 2014). The convenience of using existing valid methods will be discussed later on.

Methods per assessment purpose
We intended to find out how the type of method could be related to the purpose of the measurement. This is important when creating or choosing a method because the design of a method may not be appropriate for a different purpose. For this reason, the distribution of the methods per measurement purpose was listed in Table 1 and the frequencies were analysed to verify what type of methods were more widespread per measurement purpose.
More than half of the studies reported on methods which were used to evaluate course and programme effectiveness to enhance the quality of teaching and student learning. The most frequent (63%) method used for this purpose was questionnaires. They were used to ask students about how the course prepares them for a competency (Baral et al. 2014;Gerhart and Carpenter 2015). Questionnaires alone, unless the sample size is large enough to have statistically significant results, are not a good practice, because they report self-perceptions which are subjective to bias. However, questionnaires used in combination with other methods such as portfolios (Martínez-Mediano and Lord 2012), interviews (Dunai et al. 2015) and observations (Blanco, López-Forniés, and Zarazaga-Soria 2017) showed that the courses stimulate the development of competencies in students. For example, in the study of Martínez-Mediano and Lord (Martínez-Mediano and Lord 2012), the use of portfolios confirmed the results of the questionnaire that the intervention had improved students' ability in lifelong learning. Similarly, a combination of interviews conducted by an external researcher and questionnaires given to students proved that the project-based learning promoted teamwork competencies (Dunai et al. 2015).
The second most frequent purpose (26%) was to assess students' performance with the purpose of giving summative grading and formative feedback. The former is used to provide student grades at the end of the curricular activity to certify students' achievements, and the latter is used to provide feedback to improve students' learning (Biggs 2003). Few studies (only 7% of the studies) which reported formative feedback were found. The results show that rubrics were the most frequent (62%) method used to grade students (Fila and Purzer 2012) and to provide formative feedback to students (Ahmed 2017). Rubrics were considered good practices for this type of measurement purpose (Fila and Purzer 2012), for the reason that they were objective checklists based on student learning outcomes that allowed assessors to grade students, and to provide feedback.
The third form of measurement (11%) was aimed at measuring students' abilities in order to characterise student populations. More than half of these methods were questionnaires. For example, Strauss and Terenzini (Strauss and Terenzini 2005) aimed at assessing a large population of 4558 graduating seniors in seven engineering fields in more than one competency (e.g. communication, lifelong learning and teamwork) on a five-point Likert scale. Moreover, (Chen, Lord, and McGaughey 2013) conducted a cross-sectional study with 356 engineering student of five different fields and major. In this study, students were asked to evaluate their abilities for lifelong learning. Self-perception questionnaires were considered an acceptable strategy (Strauss and Terenzini 2005;Chen, Lord, and McGaughey 2013) when the aim was to evaluate a large population.
Within the three purposes (assess student learning, evaluate course/programme effectiveness and characterise student abilities), a limited number of studies used qualitative methods (e.g. observations, interviews, portfolios and reflections). This limitation will be addressed in the discussion.

Measurement methods per competency
A summary of the criteria found per competency is reported below, as well as a definition formulated for each competency based on the studies included. In addition, the best measurement methods per competency are suggested. This information may assist assessment developers in the development of their own competency assessment and evaluation schemes.

Competency definitions and measuring criteria
As stated by (Shuman, Besterfield-Sacre, and McGourty 2005), the lack of consensus on the definitions of the competencies creates difficulties in their measurement process. For this reason, we were Table 1. Distribution of methods with measurement purpose (1-to evaluate course and programme effectiveness to enhance the quality of teaching and the student learning experience, 2-to assess students' performance with the purposes of giving summative grading at the end of courses and/or providing formative feedback to students, and 3-to measure students' abilities in order to characterise students populations.) and competencies (CM -communication, LLL -lifelong learning, TW -teamwork, IC -innovation/ creativity, and > C -more than one competency)

Questionnaires Rubrics Tests Observations Interviews
Multiple methods (1) Evaluate course and programme effectiveness (1) and (2) CM interested to investigate how the studies define the competencies under study. A lack of competency definitions in the studies was found. Of all of them, only 17 studies explicitly define the competencies they were studying. Lack of definitions bias understanding when performing the measurement, and prejudice the replication of the studies. Since competency terms have various meanings depending on the context, it is problematic to assume that the competencies have the same synonym and do not warrant a definition. For the studies that were not providing any definition for the competencies, we decided to investigate the criteria that were used to provide clarity and measure these competencies. Although 5% of the studies did not provide any criteria to establish the competencies, using only a Likert scale to rate the self-perceived level of the competencies undefined, such as in (Moalosi, Molokwane, and Mothibedi 2012), the analysis of the 99 studies disclosed several criteria used to measure the attainment levels in the four competencies. The criteria found for each competency, their definition and the corresponding studies are listed in Appendixes A, B, C and D, respectively. In the analysis of the results, we make no distinction on the purpose of the studies as our primary interest is to evaluate the criteria used to measure the attainment levels of competencies.

Communication (Appendix A)
Among the 44 studies that measured attainment levels in communication, 31 evaluated oral communication and 24 written communication. There were 16 studies that reported on both oral and written communication. Out of the 31 studies which looked at oral communication, 16 considered it as a single criterion without sub-division. The same was found for written communication (15 out of 24 studies).
A few studies which look at other communication criteria than oral and written communication were found. These criteria included self-confidence (4), achieve/convey ideas (3), self-exposure (2), listening (2), reading (1), and client interaction (1). These criteria suggest that communication for engineers is more than just oral and written communication (Wilkins, Bernstein, and Bekki 2015). It also involves listening actively, carrying general conversations, showing understanding by means of opinions or reactions on what is discussed, and self-exposure to conversations in order to interact with others and to create networking.
Based on the criteria listed above and the definitions found in studies such as (Immekus et al. 2005), (Wilkins, Bernstein, and Bekki 2015), we propose to use the following definition of communication: communication is 'the ability to show understanding and to carry technical/non-technical written/oral presentations and discussions depending on the audience where the feedback loop of giving and receiving opinions, advises and reactions is constant'.
To measure communication, valid methods (Appendix E) were found. (Eichelman, Clark, and Bodnar 2015) and (Galván-Sánchez et al. 2017) used rubrics to measure student performances in demonstrating written and oral communication, respectively. Also, (Frank et al. 2015) has objectively measured students' performance on written communication using two valid methods (the VALUE rubric the CLA+). (Wilkins, Bernstein, and Bekki 2015), on the other hand, validated a test that measures not only student self-perceived knowledge in communication skills (such as active listening, assertive self-expression, and receiving and responding to feedback), and their confidence to use these skills, but also their ability to apply these communication skills.

Lifelong learning (Appendix B)
The top five most frequently used criteria for lifelong learning competency were found to be selfreflection (17 studies), locating and scrutinizing information (16), willingness, motivation and curiosity to learn (11), creating a learning plan (10), and self-monitoring (6).
On the basis of the definitions present in the studies (Coşkun and Demirel 2010;Martínez-Mediano and Lord 2012) and the criteria found, we define lifelong learning as 'the intentional and active personal and professional learning that should take place in all stages of life, and in various contexts with the aim of improving knowledge, skills and attitudes'.
When it comes to reporting validity, one point in time self-assessment methods (Coşkun and Demirel 2010;Douglas et al. 2014) reported on validity measurements. On the other hand, EPSA (Ater Kranov et al. 2008;Ater Kranov et al. 2011;Ater Kranov et al. 2013;Schmeckpeper et al. 2014), another method that reports validity, goes beyond self-assessment and measures student performance on lifelong learning competencies during a specific task.

Teamwork (Appendix C)
For teamwork, criteria such as interacting with others (18 studies), manage team responsibility (15), team relationship (15), communicating between group members/others (9), and contribution of ideas/ solutions/work (9) were found to be the top 5 most frequently used criteria. Criteria such as problem-solving and decision making (8), and encourage the group to contribute (7) were also often named. Therefore, based on these criteria and the definitions present in the studies (Immekus et al. 2005; Valdes-Vasquez and Clevenger 2015), we define teamwork as 'an interactive process between a group of individuals who are interdependent and actively work together using their own knowledge and skills to achieve common purposes and outcomes which could not be achieved independently'.
The valid methods present in the review provide some adequate examples to measure teamwork. For example, rubrics were used to assess students' teamwork in capstone courses and the correlation between faculty and teaching assistant assessor was shown (Gerlick et al. 2011). In (Bringardner et al. 2016), both pre-and post-questionnaires were carried out to consider the effect of social and process changes in the measurement of student competency. Finally, (Besterfield-Sacre et al. 2007) provided a valid behavioural observation method which, however more time and resource consuming, proved that teamwork was accomplished.

Innovation/Creativity (Appendix D)
From the 24 studies which looked at innovation/creativity, 7 studies referred to innovation and 17 studies reported creativity. The low number of papers studying innovation may be an indication that only a small number of curriculum elements go beyond the design process and also focus on the idea or solution implementation step; as a consequence, measuring creativity levels is often deemed enough. Both innovation and creativity measurement criteria were found to focus mainly on flexibility (15 studies), originality (13), fluency (7), elaboration (7), connection (4), and scaling information (4).
On the basis of the criteria and definitions found in the studies (Fila and Purzer 2012;Amelink et al. 2013), we propose the following definition: Innovation/Creativity is 'the ability to generate ideas and move from their design to their implementation, thereby creating solutions, products and services for existing or future needs'.
For innovation/creativity, some valid methods were reused from previous studies. For instance, the Torrance Test of Creativity Thinking, that is validated in many studies (Shields 2007;Robbins and Kegley 2010;Wu, Huang, and Shadiev 2016), but requires trained assessors and is very costly. Other valid methods reported on are the Index of Learning Styles that measures innovation based on student preferences on a sensing/intuition scale (Waychal 2014), and the Modified Strategies for Learning Questionnaire that measures the perceptions of student learning behaviours in innovation skills (Amelink et al. 2013). More objective methods that measured student performance in demonstrating innovation rather than self-perceived are the Abreaction Test for Evaluating Creativity used in (Clemente, Vieira, and Tschimmel 2016) and the VALUE rubric used in (Vila-Parrish et al. 2016).
While analysing the criteria used in the studies, overlaps in the four competencies studied were found. This is not part of the scope of this review, so we will not go into detail. This finding confirms, however, the need to provide a definition for the competencies under study. As the underlying criteria depend on the definition, future studies should provide both competency definitions and underlying criteria so that conflicting elements can be avoided and coherent competency measurements carried out.

Discussion
The number of studies which looked at students' transversal competencies such as communication, innovation/creativity, lifelong learning, and teamwork competencies has grown over the last 17 years (Figure 2). This progression is likely indicative of the importance of these competencies for engineering students success in the labour market and the increase of their integration in engineering curricula (Passow and Passow 2017).
This systematic review on competency measurement shows that measuring competency levels has become extremely important to assess student performance in courses or programmes, to certify the level of courses and curricula, and to characterise student abilities. Based on the accuracy, validity and reliability of the methods analysed, the time and cost of their implementation, and their practicality for a specific purpose, we give recommendations to aid educators and researchers to further measure competencies, in terms of the best measurement methods, the importance of competency definitions and validity and reliability properties. Also, we offer principles to be applied in the creation of reliable and valid measurement methods.

Best measurement methodsfor educators
To grade students and to provide feedback, we argue that it is not enough to ask students if they perceive competency improvements. However, the accuracy of self-assessment has been considered poor (Ward, Gruppen, and Regehr 2002). Methods that measure students demonstrating certain competencies would be more appropriate (Besterfield-Sacre et al. 2007). Rubrics can be used as a checklist to verify whether students demonstrate the pre-defined competencies and at which level (Fila and Purzer 2012). When rubrics are objectively created and validated to measure students' behaviours, they are great measurement methods that improve inter-marker consistency and reduce marker bias effects (Flateby and Fehr 2008;Scharf 2014;Eichelman, Clark, and Bodnar 2015). In addition, this consistency can be optimised with the use of more than one rater or grader and the standardisation of the scales according to graders' scores (Ward, Gruppen, and Regehr 2002). These techniques were proposed in (Ward, Gruppen, and Regehr 2002) as alternatives to reduce the issues of the efficacy of self-assessment. Rubrics can also be used for large samples, as experienced by the second author of this review (Saunders-Smits and Melkert 2011). Moreover, rubrics are useful not only to conduct summative assessment but also to provide individual feedback to strengthen detected points in students that need improvement. However, this form of assessment was little addressed by the studies reviewed.
Alternative measurement methods that are adequate to measure student behaviour are observations. However, they are very time and resource consuming. To reduce these issues, work sampling observation as validated in (Besterfield-Sacre et al. 2007) can be a very valuable method, because it reduces the amount of observation time necessary to assess students behaviour and consequently it is less labour intensive and time-consuming. We consider those behavioural measurements when based on clear criteria effective tools to provide summative and formative feedback. In Table 2, a set of practical guidelines for implementation in education is listed.

Best measurement methodsfor researchers
For researchers who are willing to measure student competencies to evaluate courses or programmes or simply to characterise a student population, we argue that questionnaires and tests that measure perceptions are considered adequate methods for these purposes when limited time and resources are available and large samples are present. Self-report methods can be easily developed and administered, and when favourable validity and reliability properties are present, meaningful inferences can be drawn from the data analysis (Immekus et al. 2005). However, when using these methods (questionnaires or tests), we recommend the use of time triangulation by employing pre-and post-questionnaires (Waychal 2014;Gerhart and Carpenter 2015;Ngaile, Wang, and Gau 2015) or pre-and post-tests (Shields 2007;Robbins and Kegley 2010) to rectify the omission of social changes and processes caused by one-time assessment (Cohen, Manion, and Morrison 2007). Instead, selfassessment can be done by ranking competencies where students have to identify their own strengths and weaknesses, which are the extremes of the scales (Ward, Gruppen, and Regehr 2002). This method was not used in any study of this review but we recommend it, because it increases the accuracy of judging one's own performance which has been a great concern in literature (Ward, Gruppen, and Regehr 2002;Eva and Regehr 2005). Also, it is considered ideal to selfdirected students' learning and to give formative feedback (Ward, Gruppen, and Regehr 2002;Eva and Regehr 2005).
Another strategy to increase validity in the case of self-perceptions is the use of multiple methods to measure the full umbrella of criteria of one or more competencies. The advantage of this is that combining different methods yield the most comprehensive information from different perspectives and a more complete understanding of the research problem (Creswell and Clark 2007). Studies in this review (Barnes, Dyrenfurth, and Newton 2012;Amelink et al. 2013;Eichelman, Clark, and Bodnar 2015) suggested that the content validity of the results of the assessment increased because the results from different methods could be compared, explained and verified, and the strengths and weaknesses of the methods could be drawn and minimised, respectively. For example, the use of rubrics alongside interviews benefit from their individual power: the rubric with described levels guides the assessor and reduces inconsistencies in the assessment because the measurement criteria are clear, delimited and objective, and the interviews offer more comprehensive information about students' competency development and since interviews are more flexible richer details can be obtained (Eichelman, Clark, and Bodnar 2015). Alternatively, researchers could employ a combination of questionnaires, which are straightforward and require little administration, with observations, which provide in situ data from the situations which are taking place (Amelink et al. 2013). Guidelines for researchers to create reliable and valid measurement methods are listed in Table 2.
At the moment, works published on competency measurements present in literature tend to rely heavily on the course evaluation only, and we were unable to find any longitudinal studies where students were followed in their years after completion of those courses or even after graduation. In future, educational researchers could consider using, if ethical boards allow, and willing participants are found, e.g. portfolios or interviews to perform longitudinal studies by collecting data from the same group of students at different points in their life, thus following the level of competency improvement of the students during their time at their institution and ideally also after graduation in their working life.  (Fila and Purzer 2012). 4) Standardise scales/checklists i.e. create familiarity with the levels/dimensions of the scales and rescale them based on graders' assessment scores (Ward, Gruppen, and Regehr 2002). 5) Use more than one grader (Ward, Gruppen, and Regehr 2002). 6) Analyse the level of agreement between the graders testing inter-rater reliability (Cohen, Manion, and Morrison 2007). 7) When using self or peer assessment questionnaires, ask students for aspects that they need the most and least improvement (Ward, Gruppen, and Regehr 2002).
3) When measuring learning or growth, measure student performance on a competency before and after instruction (Cohen, Manion, and Morrison 2007) or ask students for extremes: what they learn the most and the least (Ward, Gruppen, and Regehr 2002). 4) Analyse the reliability and validity properties of the measurement to evaluate both the accuracy of the method and whether the method measures what it intends to measure (Cohen, Manion, and Morrison 2007). 5) Use multiple methods when corroboration, elaboration, clarification and expansion of the results is needed (Creswell and Clark 2007).

Importance of definitions and validity
We observe that in some studies there is an effort in developing valid competency measurements. Some described competencies based on literature, industry and students feedback; Others used multiple methods to improve content validity or conduct factor analyses to increase construct validity. In addition, some studies used existing validated measurement methods. Choosing existing valid and reliable instruments may form a helpful option for assessment developers and instructors to measure competencies in students. However, learning outcomes, competencies and course or programme settings should be carefully considered and compared to the conditions of the existing studies, to ensure their applicability. Re-evaluation of validity and reliability are still necessary when implemented in a new situation (Cohen, Manion, and Morrison 2007). Although robust methods were found, some studies did not define the content being measured and therefore they overlook content validity. Lack of consensus on the definition of the transversal competencies was a cause of difficulties in the process of competency measurement (Shuman, Besterfield-Sacre, and McGourty 2005). Likewise, the lack of definitions may hinder the measurement of competencies. In this literature review, 83% of the studies identified and included did not present a definition of the assessed transversal competencies, and 5% did not provide any criteria to establish the competencies. What were the perceived definitions of students or instructors when using these methods without definitions or descriptions? It is possible and acceptable that the definitions of competencies determined by different entities could be different. However, it should be clear for all involved parties what the definitions of the terms used are. Only with clear definitions and descriptions can measurement of competency attainment levels be understandable and valuable.
Overall, competency level measurement would benefit from better method design and validity evidence. The only way to ensure that the results obtained from the competency measurements are accurate and can be properly interpreted is through a clear and described assessment design and by carrying out validity and reliability measurements. Only 39 studies had methods that consistently measured reliability and validity. This means that accurate results can be extracted from only 39 out of the 99 studies. Validity and reliability measurements provide feedback to both researchers and educators whether methods measure the initial proposed concept and allow them to engage in subsequent revision and improvement of the measurement methods.

Conclusion, limitations and recommendations
This systematic review set out to inform the creation of reliable and valid measures of student mastery of competencies in communication, lifelong learning, innovation/creativity and teamwork in engineering education. We analysed measurement methods of 99 studies published in the last 17 years. This review described the type of methods that measure the four previously mentioned competencies, and their advantages and disadvantages, and validity and reliability properties based on the studies analysed. From the analysis of these findings, the best methods per purpose and competency are presented. Additionally, a definition for each competency and its underlying criteria are reported to assist assessment developers in the design of their own competency assessment and evaluation schemes.
Some limitations in the current studies that measure competencies have arisen regarding competency definitions and validity and reliability measurements. The analysis showed that a large number of studies lack a clear definition of the selected competency. Based on these issues, we shed a light on the importance of providing clear definitions and underlying criteria for the competencies under study. As such, we created a clear definition for each competency.
Moreover, less than half of the studies presented evidence of validity and reliability measurements. This result shows that a clear need to set professional standards when measuring competencies are needed and that future studies should report on validity and reliability measurements.
Questionnaires and rubrics were the methods mostly used to measure these competencies. We argue that both are adequate methods when properly validated with the techniques present in this review. Questionnaires, applied in the form of pre and post-questionnaires, are particularly useful for assessors/researchers to evaluate course or programme effectiveness and characterise students' abilities in the presence of large student populations. This review also showed the usefulness of combining methods (particularly questionnaires with interviews or observations) to increase the validity of the studies. As such, researchers are encouraged to use multiple methods when evaluating the effectiveness of courses or programmes to stimulate student competencies, and when characterising students' abilities.
On the other hand, rubrics benefit evaluators in the grading and feedback processes both for small or large populations when their scales are clearly defined according to course learning outcomes. Questionnaires that ask students for aspects that they need the most and least improvement are also a good practice. Alternatives are observations, portfolios and reflections, however they are labour intense and more time-consuming.
While there is a global concern and effort in engineering education to measure competencies in communication, teamwork, lifelong learning and innovation shown in this review, engineering educators and future researchers should double their efforts to provide competency definitions and validate their measurement methods. We believe that time, energy and cost are undesirable limiting factors, but other issues such as lack of expertise and accuracy in the design and implementation of the measurement tool must be overcome. It may be worth as to why only a few studies provide explicit transversal competency measurement instruments. This may help improve the field of engineering education in the area of competency measurement.
A potential limitation of this systematic review is that powerful papers might have been left out because we might have excluded alternative terms used to name the four competencies. The review was also limited to engineering students, three databases and the past 17 years. It may be worthwhile in the future endeavours to expand the review to the fields of science, technology and mathematics, other databases and possibly look at papers before 2000.
The PREFER project will use the lessons learned in this review to create a measurement tool that measures students' mastery levels in courses. A valid and reliable method will be designed. We will define competencies and their subcomponents to address the full extent of each competency; perform confirmatory and exploratory factor analyses and test the internal structure of the measurement scales. The outcomes of the tool will be triangulated using student reflections as this is the most feasible method within the scope of our project.

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Mariana Leandro Cruz received the BSc and MSc degrees in Biomedical Engineering from the Instituto Superior Técnico, University of Lisbon. She is currently developing the PhD in engineering education at Delft University of Technology, Faculty of Aerospace Engineering in the Netherlands. Her research interest include engineering education, competencies, competency measurement, and course development.
Dr. Gillian N. Saunders-Smits is a Senior Lecturer/Associate Professor at Aerospace Engineering faculty in Delft University of Technology. She has a Master Degree in Aerospace Engineering from Delft University of Technology, Faculty of Aerospace Engineering in the Netherlands and a PhD in Aerospace Engineering Education from the same institute. She has extensive teaching experience as well as a broad experience in curriculum development (BSc and MSc) and educational leadership. She has served as the Faculty's online education coordinator and actively develops online courses and platforms for open learning. In 2017, she has been appointed board member of the Steering Committee of the Société Européenne pour la Formation des Ingénieurs (SEFI).
Prof. Dr. Pim Groen is the professor of SMART Materials at Aerospace Engineering faculty in Delft University of Technology and the Programme Manager of Holst Centre, TNO, the Netherlands. He studied chemistry and obtained the PhD in Materials Science from the University of Leiden in the Netherlands. He has extensive experience as a scientist and from 2012 he has been appointed as a part-time professor where he can spread his scientific knowledge and industry experience with Aerospace Engineering students at Delft University of Technology.