Strategies for Researching Programs’ Impact on Capability: A Scoping Review

ABSTRACT Researchers seeking to assess the impact of a program on the capability of its target audience face numerous methodological challenges. The purpose of our review was to see to what extent such challenges are recognised and what choices researchers made in order to address them, and why. We identified 3354 studies by searching five databases in addition to cross-checking references from selected studies. A total of 71 studies met our pre-defined selection criteria: empirical studies reporting data on how interventions impacted the beneficiaries’ capability, providing sufficient detail on how impact was measured, in English language. Four independent raters assessed those studies on four domains: descriptive information, consideration of causal attribution, operationalisation of capability, and interpretation of findings. Challenges related to capability impact assessment were not widely explicitly acknowledged, and available measures to address these challenges were not being used routinely. Major weaknesses included little attention to causal attribution, infrequent justification of the specific content of capability, and failure to research the constitutive elements of capability and their interactions. Research into a program’s impact on the capability of its recipients is challenging for several reasons, but options are available to further improve the quality of this type of research.


Introduction
The capability approach (CA) is a normative-empirical framework which asserts that human well-being should be primarily assessed from capabilities, or the real freedoms people have to be and do things they have reason to value (Sen 1992(Sen , 1999. The approach has been developed as an alternative to the limitations of measures such as utility, commodities holdings, or liberties, among others that serve as a proxy of well-being (Van Staveren 2008;Venkatapuram 2011). Problems associated with these measures include the phenomenon of adaptive preferences (or response shift) (Elster 1982;Festinger 1957;Sen 1987;Teschl and Comim 2005), while others disregard the questions of whether and how individuals can transform those possessions to create something that is of value to them, or ignore the contextual conditions that produce inequalities in well-being (so-called conversion factors; Sen 1992). In contrast, the CA recognises and integrates the process of personal and social construction of capabilities, as well as identifies the importance of an objective perspective in assessing deprivations to counter subjective reporting biases (Alkire 2002).
Taken together, these critiques and positive arguments from the CA suggest that the impact of programs and interventions that aim to increase well-being are also best assessed from their impact on the target audience's individual or collective capabilities (Ibrahim 2006;Keeley et al. 2015;Lorgelly et al. 2015;Mitchell et al. 2017;Nussbaum 2011;Simon et al. 2013). Thus, the pressing question is how best to do that. The objective of this paper was to identify to what extent researchers evaluating the impact of programs or interventions on capability recognise the accompanied challenges, how they address these challenges, and how they support their interpretations. We followed the framework by Arksey and O'Malley (2005) for scoping reviews to identify, select and review relevant studies.
The key findings are that although there are promising studies that report on their findings of capability impact assessment, there are still many challenges in conducting and reporting clearly on the impact of programs on recipients' capability.

Challenges Associated with Assessing the Impact of Interventions on Recipients' Capability
Clearly, (monitoring and) assessing the impact of a program or intervention on the capability of its target audience requires some form of operationalisation of the capability concept. Some of the difficulties that are associated with this task have been previously recognised (Robeyns 2003(Robeyns , 2005. Here, we will focus on four such operational challenges associated with assessing impact of interventions or programs on recipients' capability. These include: . Identification of the content of capability, relevant to the particular contextwhat is it that people should be able to be or do? . Establishing whether people are able to do or be something when, in fact, they are not being or doing such things. In other words, how can one plausibly establish whether people have (or don't have) "real freedoms"? . Causal attribution: if there are indications that a target audience's capability has expanded after an intervention, can this be confidently ascribed to the program or intervention under study? . Determining an appropriate time frame for such studies: how much time would it realistically take for a target audience's capability to expand in a meaningful way, taking into account relevant contextual characteristics? In the following discussion, we will briefly expand on each of these challenges.

Deciding on Capability's Relevant Content
Assessment of an intervention's impact on the capability of its intended beneficiaries requires that the capability content is specified. For example, for a community level intervention, what beings and doings are considered valuable that they should, to a sufficient degree, be accessible or realisable for all members of the community? Broadly speaking, two different answers have been given as to how such content may be established (Claassen 2011). Sen (1992) holds that a public deliberative process is needed to achieve this end. In contrast, others (most notably Nussbaum 2000) have argued that core capabilities may be identified that are foundational or of generic relevance and validity to all human beings. Combinations of these two approaches have also been suggested (e.g. Burchardt and Vizard 2011). Without taking sides in this debate, the important point here is that researchers of a program's impact on a capability/ies should specify its content and indicate how it was established.

Assessing People's Freedom
A second challenge associated with capability impact assessment is that capability refers to the real opportunities that people have to realise valuable doings and beings. In this sense, capability is a measure of freedom or real possibilities, not mere achievements. Measuring opportunities is difficult. Furthermore, there is the issue of whether the things that people end up doing and being in their lives can be considered to be a result of their own choosing, or whether they are largely outcomes from structures and constraints imposed upon them. Hence, establishing whether more people display valued modes of doing and being (or display the modes to a greater degree than before), difficult as it may be, is not an adequate indicator of capability expansion in the sense valued by the CA. The CA distinguishes between capabilities on the one hand and displayed modes of being on the other. Therefore, non-achievement of certain modes of being may be a result of a person's own choosing and not from actual constraints in his or her capabilities. There are several options for capability impact researchers to address the challenge of assessing this freedom space.
Firstly, they may rely on self-reports from research subjects. Using questionnaires or interviews, subjects are asked to report whether, in what way and to what extent, their real opportunities for realising specific valued doings and beings have increased since the onset of the intervention of interest. A possible drawback of such an approach is that people may either over-or underestimate their own capability. Strictly speaking, this approach is more likely to capture people's views on their (changes in) self-efficacy (Bandura 1978), rather than their capability.
A second approach would be to query subjects about their actual functionings, or to observe their behavior in their daily settings. The reasoning is that functioning, when present, implies pre-existing capability. A drawback associated with this approach is, similar to above, that non-achievement need not imply absence of capability, but the choice not to do so. This would raise further questions regarding what factors impacted such choices, and if those choices were truly free choices.
A third approach that capability impact researchers can take is to identify and explore the possibility conditions for specific capabilities in a particular context (Robeyns 2005). When, for instance, cycling has been identified as a specific type of doing that people ought to be capable of, possibility conditions would include the presence of certain motor and sensory capacities on the part of an individual, having a bicycle at one's disposal, the presence of safe cycle paths, etc. To assess the impact of a program on a target audience's capability, researchers would not so much rely on subjects' reports, nor on some sort of participant observation, but on establishing to what extent such possibility conditions are met to a greater extent as compared to prior the program's rollout. A drawback of such an approach would be that it heavily relies on presumed relations between theoretical possibility conditions and capability which, in reality, may not hold.
These three approaches for assessing people's real freedoms need not, of course, exclude each other. Indeed, combining two or all three approaches may work to achieve triangulation (Wolff and De-Shalit 2007).

Causal Attribution
Even when there are indications that the unfolding program or intervention is associated with the strengthening of recipients' capability of the sort that was hoped for, it may be uncertain whether the changes were directly or wholly caused by the intervention. This is not specific to the assessment of capability, of course. Alternative explanations could include for instance maturation (or, more generally, changes that would have occurred, independently of the intervention), confounding, bias, or natural variation (Marsden and Torgerson 2012). Several solutions exist to rule out such alternative explanations for observed changes, either in the initial design of the study or the analysis of the data (e.g. Rogers 2014). These include, but are not confined to, comparative research and random assignment of research subjects to experimental or control conditions, blinding of one or more of the parties to the allocation, and statistical analysis of the findings. Such designs or data production may not, however, always be feasible, or appropriate, in the context of assessing interventions aiming for capability expansion (e.g. Black 1996). In such instances, researchers can adopt various other strategies to strengthen claims of causality (Maxwell 2004).

Time Horizon
Finally, capability expansion may take some time to materialise and, conversely, changes in capability, when achieved, may not always be sustained or robust over time. Although this may not be unique to capability, it is an extremely relevant challenge with capability interventions due to the special character of capabilities. Therefore, the timing of the assessment of capability impact post the unfolding of the intervention can be critical, and it may be necessary to conduct assessments at multiple moments over time (Mayne 2008).
To put this review in context, we wish to highlight two recent publications that provide excellent reviews of a number of capability measurement instruments (i.e. ICECAP, ASCOT, OCAP, OxCap, and ACQ-CMH) and their use in the context of economic evaluation (Helter et al. 2019;Proud, McLoughlin, and Kinghorn 2019). While these reviews consider health-related capability measurement instruments and their ability to capture the outcomes of value, our present review should be considered complementary to this health economic evaluation focused work. The present review is interested in all approaches to measure capability impacts, in all different domains of well-being.

Methods
We used the framework of Arksey and O'Malley (2005) for conducting scoping reviews. This framework prescribes identifying the research question of the scoping review, identifying relevant studies, selecting relevant studies, charting the data, and reporting results. Each of the steps is discussed below.

Identifying the Research Question
Researchers who seek to assess the impact of a program or intervention on the capability of its target audience are faced with a number of specific methodological challenges. The purpose of our review is to see to what extent such challenges are recognised by researchers and, if so, what choices researchers made in order to address them, and how these choices were justified.

Identifying Relevant Studies
Developing a search strategy for identifying relevant studies was challenging for the following reasons: firstly, the term "capability" has a general meaning, not necessarily referring to Sen's concept; secondly, capability is a generic concept of well-being, used in a wide range of domains, including health care, education, housing policy, employment, and development aid; thirdly, reports of capability impact studies can be found in a wide range of bibliographic databases, and in the form of journal papers, book chapters, books and reports.
Because of these reasons, a search strategy to identify potentially relevant studies was developed inductively. We started by identifying ten relevant and diverse studies through manual searching. The characteristics of these studies are presented in Appendix A. Possibilities for relevant search terms and databases to be included were derived from this seeding set. In turn, the search strategy to be developed had to yield at least each of these manually identified 10 studies. The search strategy is available on request from the corresponding author. It included five databases (PubMed, SCOPUS, Sociological Abstracts, International Bibliography of the social sciences, and Econlit). Boolean operators (e.g. AND, OR, and NOT) were adapted as appropriate for each database. The search strategy as used in PubMed is presented in Box 1. The search was completed by cross-checking references from selected studies. We collected and managed the studies in reference management program EndNote version X9 in September 2020.

Selecting Relevant Studies
We included studies if they met the following inclusion criteria: . an empirical study, . reporting data on how specific interventions or programs had impacted on the capability (as "the real freedoms people have to be and do things they have reason to value") of the program's beneficiaries, . providing sufficient detail on how impact of an intervention was measured (in the paper itself or in appropriately referenced papers), . English language.
Although the interventions or programs in the included studies did not have to be specifically designed to impact capability or well-being, the authors did have to claim an impact on the recipients' capability in order to be included. We did not exclude studies based on year of publication.

Charting the Data
Charting consisted of the extraction and summarising of relevant characteristics and data from the individual studies, taking into account the four types of challenges that were identified to be associated with capability impact assessment. We developed a checklist (Appendix B) iteratively with feedback from all four reviewers of the initial identification of relevant studies (Details omitted for double-blind reviewing). For critical appraisal of causal attribution, criteria were adopted that are appropriate for quantitative (Shadish, Cook, and Campbell 2002) or qualitative research (Giacomini and Cook 2000). The checklist aimed to guide reviewers to provide answers to the following four domains: . Descriptive information. This included questions on how the target population was disadvantaged and what the disadvantaged target group should be able to do (the maintained norm), according to the authors. . Consideration of causal attribution in research design: questions derived from quantitative causal attribution theories on prospective or retrospective assessment, control group, randomization, and blinding. . The operationalization of capability. Here, reviewers could describe the design (time horizon, quantitative, qualitative, or mixed methods) and the capability approach elements (resources, conversion factors, and functionings) included. . Discussion of the interpretation of findings. This part involved a critical review of the reported hypotheses, outcomes, and conclusions.
For every included study, two reviewers independently filled out the questionnaire. In the cases of inconsistencies, reviewers discussed in order to achieve consensus. Although each of the reviewers is familiar with the capability approach, their backgrounds differ, with a focus on capability in relation to health and healthcare (JM), research methodology (BB), impact assessment (GJvdW), and capability in relation to disability (WR), respectively.

In-& Exclusion of Literature
We summarised findings in tables and graphs; a complete synthesis (based on achieved consensus between reviewers) of all items per individual study is presented in Appendix C. The literature search produced 3354 references, see Figure 1.
After removing duplicates, we screened 2289 unique studies for English language and use of the term "capability" in terms of the capability approach. Titles and abstracts of the remaining 954 studies were screened for empirical study and study objective (assessment of impact of a specific program or intervention on a target audience's capability). After full-text screening of the remaining 140 studies, we assessed 71 studies using the checklist that we developed for the review (Appendix B). For 20 studies (28%) further discussion by reviewers was needed before full consensus on the abstract could be reached.

Descriptive Information
As the checklist consisted of both open (e.g. "What is the target population?") and closed questions (e.g. a multiple-choice question asking, "Which capability elements were assessed?"), we listed the summary of answers to multiple-choice questions in Table 1 and presented more elaborate details on the answers to open questions verbatim.
The majority of the 71 studies had a qualitative component in its research design: 52 studies used exclusively qualitative methods, such as interviews, while 17 studies mixed qualitative and quantitative methods. Only 2 studies used a pure quantitative research design. In 25 of 71 studies (35%), the target population of the intervention consisted of children or adolescents; in 17 studies (24%), the target group consisted of women. Other prevalent characteristics of target populations were indications of poverty (15 studies) and disability (6 studies). The sample size varied from a minimum of 3 participants (a case study employing participant observation as research method) to a maximum of 2540 (survey), with a median of 37.5. In 63 of the studies (89%), the authors provided arguments and / or data to support the notion that at the start of the study, the capability of the target population was constrained.
Of the 71 studies, 24 reported on interventions related to development aid, 20 on education-related interventions, 11 on unemployment programs, and 9 on health-related interventions. The 7 other studies reported on interventions related to decision making, sports, sociability, or on multiple domains. Studies were conducted across the globe, with India (8) and South Africa (7) as most frequent sites. There were 30 countries that each yielded a single capability impact study. Yes (63), maybe (6), no (2) Country India (8), South Africa (7), Nigeria (4), United Kingdom (4), Mexico (3), Scotland (3), Other (36) Consideration of causal attribution in research design Pre-post assessment No (66), yes (5) Control group No (58), yes (13) Random assignment of subjects No (70), cannot be judged properly (1)  Blinded Not blinded (71) The operationalization of capability Capability elements assessed: Functionings (62), resources (52), conversion factors (52) Assessment of interaction between capability elements Yes (44), no (26), cannot be judged properly (1) Type of study: Qualitative (52), mixed methods (17), quantitative (2) Source of information on capability impact Subjects/self-reported (70), person(s) close to subjects (31), document analysis (15), statistics (8), researchers (7) Retrospective vs prospective Retrospective (66), prospective (5) Discussion of the interpretation of findings Authors' conclusions regarding impact of the intervention on participants' capability: Positive (37), mixed (30), negative (1), unclear (3) Causal Attribution of Observed Impact of Intervention on Capability No studies blinded either respondents or researchers, or randomly assigned their subjects to their research group(s). Approximately one in five studies included a control group, while five of 71 studies adopted a beforeafter design. Over half of the studies (38) did not report clearly the time that passed since the start of intervention; in 12 studies, the assessment took place while the intervention was still ongoing. Of the remaining 21 studies, the shortest follow-up time was 1 month, the longest 18 years. Median follow-up time was two years.

The Operationalisation of Capability
The studies varied considerably in terms of the content of capability (i.e. what is it that the target audience ought to be able to do or be?), as well as the way this focus was established. Broadly, studies could be categorised in three types. The first type are studies that used capability lists that have been published in the literature, including the ones developed by Nussbaum (2000), Ibrahim and Alkire (2007), Powell and McGrath (2014), and Gigler (2014). We identified eight studies using this approach. Secondly, 18 studies focused on specific doings or beings that were considered of value to the program's target audience and where relevant inequalities were presumed to exist. Examples include being able to hear, being able to be employed, being able to access electricity, and being able to have financial security. In general, no further justification was provided for the selection of those doings or beings as the basis for capability impact assessment (e.g. as an outcome of some deliberative process, as suggested by Sen, or on the basis of some list of capabilities that were considered applicable in the relevant context). The remaining 45 studies assessed an intervention and investigated changes in specific endpoints. When changes materialised, these were interpreted in terms of an expansion or tapering of capability.
The majority of the studies (n = 61) aimed to provide insight into changes over time (putatively caused by the intervention under study) in multiple components of capability, i.e. resources, conversion factors and functionings. Of the 71 studies, 44 reported on an interaction between those capability components. Such postulated interaction was, however, often derived from informal observations by respondents or the researchers themselves.

Researchers' Interpretations of Findings as Evidence of Impact on Recipients' Capability
Particularly in case of an abstract concept such as capability, researchers will have to find ways for demonstrating whether their findings can, in fact, be interpreted as evidence of changes (or lack thereof) in capability. When instruments such as ICE-CAP or OxCap are being used, this critically hinges on evidence of the validity of such instruments (Helter et al. 2019). In other cases -the subject of this review-researchers need to find other ways for supporting the credibility of their interpretation of research findings. Here, we will briefly summarise two strategies that were observed in multiple studies.

Triangulation
Particularly when certain design measures such as randomisation or blinding are considered infeasible, inappropriate, or unethical, triangulation may offer researchers opportunities to more confidently causally ascribe observed outcomes to an intervention (Hammerton and Munafò 2021). An example of such triangulation is the study by Alkire (2002). In this study, the seven basic goods as developed by John Finnis (1980) were used to further specify the capability concept. Interviews were held with beneficiaries of a development program, asking them to reflect on how, in retrospect, their lives had changed (if at all) since the start of the program. If certain basic goods did not appear in the interview, researchers asked questions to probe whether in these dimensions things had, in fact, not significantly changed.
Multiple researchers independently analyzed the records of these interviews to see whether self-reported changes could be related to any of the seven basic goods. Results were presented in tables, listing the basic goods and quotes from interviews that were interpreted by the researchers as evidence of changes in these basic goods. Through the quotes, the reader can develop a concrete picture of the changes that were experienced and reported by the interviewees as associated with the deployment of the program. The reader can also judge whether he accepts the interpretation of the reported changes as evidence of change in the concerning basic good. Apart from the use of extensive quotes from interviews, Alkire corroborated her findings by conducting participant observations and by collecting data on changes in resources and conversion factors, a strategy that may be denoted as strong triangulation (Wolff and De-Shalit 2007).
Another example is the study by Lindeman (2014), who defined capability as "the integration of abilities, means and opportunities to reach desired wellbeing". In her study she followed recipients of a low-cost housing project in Tanzania, collecting photos, notes, and memos. In addition, she conducted in-depth and shorter interviews.

Use of a Specific Framework or Theory of Change
A second strategy that we found in multiple studies is the use of a specific framework or theory of change to capture capability and how it came about. Biggeri and Ferrannini (2014) proposed the opportunity gap (O-gap) analysis, a framework that emphasised feedback loops between capability elements in time. Another framework by Mink et al. (2015) was meant for product design, called the Opportunity Detection Kit. They evaluated the impact of a cooking stove in rural South India. A third example is the choice framework, used by Kleine (2010). Kleine conducted a qualitative assessment of the impact of ICTs on disadvantaged micro-entrepreneurs in Chile. Her choice framework, based on the capability approach, considers structure, agency, degrees of empowerment and development outcomes. It provided comprehensive evidence for structural social barriers and personal factors that possibly limit and promote desired capabilities.

Discussion
The capability approach has attracted a vast amount of scholarly attention since the early 1980s, when it was first proposed. Many interventions and programs that are run to address disadvantages and well-being can be conceived as having as their ultimate goal helping people to develop or protect their capabilities. Hence, it stands to reason that researchers choose to evaluate such programs for the impact that they have on their target audience's capability. The challenges that are associated with such a task have been well recognised (Burchardt and Vizard 2011;Chiappero-Martinetti and Roche 2009;Hollywood et al. 2012;Leßmann 2012;Mitchell et al. 2017). The results of this review indicate that these challenges are not widely explicitly acknowledged in the field of capability impact assessment, and that measures to address these challenges are not routinely used.
Our starting point was that researchers who wish to explore a program's impact on the capability of its target audience may be expected to pay attention to the following aspects: . provide evidence or reasons why it is reasonable to assume that prior to the deployment of the program, the target audience's capability is unduly constrained in one way or another, resulting in some type of inequity. . make explicit the standard by which this is the case: what is it that members of the target audience should be able to do or be, in what way and at what level (in other words: what is the content of capability that is deemed appropriate for the relevant context?). . take measures that enhance the credibility of the causal attribution of the findings: if findings suggest that capability has changed, is it reasonable to assume that such change was, in fact, brought about by the program under study? . justify the time frame of the study: given the nature of capability, emerging from the complex interactions between resources, conversion factors and functionings, its development may take considerable time. The question is, therefore, whether the researchers allowed for sufficient time for capability to develop.
In the following, we will discuss each of these aspects in more detail. We will close by discussing to what extent the findings of our review should prompt us to reconsider, at least partly, the criteria that we have used to appraise this specific body of literature.

Evidence of Capability Deprivation, the Standard, and its Justification
When the value of some program is inferred from its impact on capability, it is generally implied that prior to deployment of the program, the capability of its target audience is unduly compromised, in one way or another. The aim of the assessment is to see whether the program can at least to some extent remediate this. The question is, then: how do we know that the capability is compromised, what is the standard that is being employed here, and where does it come from? Each of these three questions merit discussion or clarification, so that those who learn about the results of a capability impact assessment can put them in perspective.
We found in our review that such questions are discussed only to a limited extent. Studies with proposed standards were limited (about one-third of the studies), and in general with little justification. Of these three questions, the issue of justification is perhaps the most challenging. As mentioned in the introduction, there seem to be two schools regarding the question how the content of capability (what is it, exactly, that people in a specific context ought to be able to do or be?) is to be established. On the one hand, lists have been drawn up, containing broad categories of doings and beings that are considered to be universally valid. On the other hand, Sen has always wanted to stay away from such, in his view probably overly prescriptive or overly specified lists. Instead, he preferred that some deliberative process is used in order to decide on the content of capability that is deemed appropriate for the participants' context. An example of this approach was the study by Biggeri and Ferrannini (2014) and their O-gap analysis. There, the content of capability was identified by the population of interest through participatory group interviews among people with disabilities and their caregivers. It may not, however, be necessary to choose between these two strategies. The reason for this is that the categories of doings and beings included in the various lists may be defined at a very general level. In such case, it is not possible to decide what would follow from a commitment to these broad categories in concrete situations right away. For this, these broad categories need to be specified (Richardson 1990(Richardson , 2018. Hence, the subject of a deliberation would be how these broad categories would best be specified, rather than defining those broad categories themselves.
The use of lists, then, can be combined with organising a deliberative process, provided that the presumed valued modes of doing and being are phrased in a sufficiently general way.

Capturing Capability Change
Capability may be conceived as a measure of freedom: the real opportunities people have to be and do things they have reason to value. Raising the question whether there is evidence that people's capability was expanded is tantamount to raising the question whether their freedom was enlarged. When there are indications that people have gained a clearer understanding of what constitute doings and beings that represent value to them, and that they have gradually expanded activities in such domains, this may be taken to suggest that their freedom has, indeed, increased. Exploring concomitant changes in resources and conversion factors can, then, shed light on how such change was brought about. What is being assumed, here, is that capability expansion is expressed in observable changes in people's doings and beings and associated possibility conditions (resources and conversion factors). The task of the researcher is to make sure that such changes, if present, have been accurately established, and that they can be plausibly ascribed to the program that was deployed (for the latter, see below). It would suggest that the researcher is not, or not solely, dependent on perceptions and reports by members of the target audience about changes in capability.
However, such conditions may not always hold. Programs may have resulted in removing obstacles for capability, without this being translated in altered behavior of the target audience. In such cases, researchers may need to rely on respondents' judgments or experiences, as reported in interviews or surveys. In order to enhance the reporting of these procedures, we urge researchers to provide an account of the population's capability disadvantage, the norm applied to this population and its determination, and the approach to determining capability change. Practical suggestions as to how this may be achieved are presented in Appendix D.

Causal Attribution
As a specific type of intervention research, capability impact assessment cannot avoid the making of causal claims. Two types of such claims may be distinguished: causal attribution of observed changes, differences, or trends in capability to the program under study, and claims regarding changes in resources, conversion factors, functionings and their mutual interactions acting as constraints on or affordances for capability development. In quantitative intervention studies, randomisation, blinding, (placebo) control and statistical analysis are the chief means of rendering confounding, bias and chance less likely explanations of observed changes or differences. Our review showed that such measures are rarely, if at all, used (in fact, of the 71 studies, 7% included a prospective design, 18% used a control group, and no studies used random assignment of subjects or blinding). Researchers might rebut that such measures are largely inappropriate in this particular field of research, or simply unfeasible. That might, indeed, be the case, seeing that no studies in our review blinded or randomised their designs. However, we would then expect other types of measures that would support causal interpretation of the findings, and these, we found, are sparse as well. For example, there were only two studies that included both a prospective design and a control group (Mariscal Avilés, Benítez Larghi, and Martínez Aguayo 2016;Mauro, Biggeri, and Grilli 2015), and the latter was one of two quantitative designs.
It needs to be acknowledged, however, that the use of qualitative research in establishing causal relations is equivocal (Maxwell 2004). In quantitative research, the focus is on discovering patterns in data (regularities and irregularities), allegedly produced by a causal mechanism that itself is not directly observable. Causal attribution is considered to be more likely if competing explanations (confounding, bias, chance) can be ruled out by using the sort of measures mentioned above. More often than not, researchers remain agnostic regarding the exact nature of the causal mechanism itself (black box evaluation, see, for instance, Ramaswamy et al. 2018). A different notion of causality, seen more often in qualitative research, holds that at least certain aspects of causal mechanisms can be empirically observed. It holds that these empirically observable elements are not merely traces of some causal mechanism that is or has been at work, but that these constitute elements of the causal mechanism itself (Maxwell 2004). The focus, here, is not merely on establishing patterns in the (qualitative) data, but also to make inferences about the likely nature of the underlying causal mechanism through abductive reasoning (Aliseda 2009).
Clearly, when adopting this strategy, researchers can be led astray in two different ways. Firstly, by erroneously making claims about patterns in the observed data, and, secondly, by drawing wrong conclusions about the nature of the alleged causal mechanism. Strategies that have been suggested (e.g. by Maxwell 2004) to protect researchers from making such errors include: . long-term and deep involvement of researchers in the practice that is being studied, . the production of rich data, revealing various aspects of the objects or processes being studied. . development of an account that puts a wide range of findings in a coherent framework. . making observations on phenomena in a number of different ways, e.g. through participatory observation, interviewing, surveys, and document analysis (triangulation).
. searching for discrepant evidence, that is, findings that seem to challenge either the alleged pattern in the data or the proposed explanation, and . member checking.
Jointly, these recommendations may be considered a strong plea for socalled theory-driven evaluation (Chen 2012). We have incorporated these recommendations in a brief guidance to capability impact assessment (Appendix D).

Reflection on the Evaluative Framework that we Employed
The framework that we used in order to appraise capability impact studies corresponds with criteria for validity assessment that have been suggested in the literature, even though the phrasing used may differ. As such, we think the framework is a reasonable starting point, with one important exception. We held that, generally speaking, prospective research would be preferable to retrospective research. The obvious reason for this was that prospective research is not afflicted with the complications that are posed by recall bias. However, this may not be entirely true in the case of capability impact assessment. If, as discussed above, researchers need to some extent to rely on respondents' reports of capability change, a prospective approach may be quite problematic. It would require that respondents are asked to reflect, prospectively, on what would constitute, for them, valuable doings and beings, and how that compares to their current situation. When conducted retrospectively, respondents can be asked to indicate how their daily life has changed in relation to the roll-out of the program. This may be an easier task, and, therefore, constitute a more valid approach. Having collected such information, it is up to the researchers to relate reported changes to broader categories of valued modes of doing and being. This approach was taken by, for instance, Alkire (2002), Cabraal (2010), and Powell (2012.

Limitations
A key limitation of our study is that we do not know whether we have missed important capability impact studies and, if so, how that would have affected our conclusions. Indeed, there may be studies that we were unable to retrieve with our search strategy, and that are more consistent with the criteria that we proposed. If that were the case, it would be, in a way, good news, and the situation would be less dispiriting than our review suggests. Our conclusion would, however, still be that there is a sizeable number of studies in this area that do not explicitly address the various common and unique challenges associated with capability impact assessment. Having said that, we do wish to point out that the diversity among the studies that we did find was substantial and included studies that can serve as inspiring example for future studies.
There are three further limitations that we wish to acknowledge. Firstly, we did not differentiate between individual and collective capabilities (Ibrahim 2006). The focus in the studies that were included in this review seems to have been on individual capabilities, but this may at least be partly due to not making the distinction in our search of the literature.
Secondly, our focus has been on capability (how well people's lives are going), and not on agency (who or what controls them) (Crocker and Robeyns 2009). As such, this review is necessarily silent on whether, and if so how, researchers have also addressed the issue of agency in the context of capability impact assessment.
Thirdly and finally, it is important to note that the goal of our study was to see whether researchers acknowledge the varied and considerable challenges associated with capability impact assessment and, if so, how they try to meet these challenges. This practice-based focus incurs a limitation, of course, in the sense that methods that could be quite useful in this respect but that have not yet been used in reports of capability impact assessment failed to appear in our search. Among these are, for example, are Krishnakumar's work on structural equation modeling, Andreassen and Tommaso's work on random utility models and Bayesian stochastic frontier models (Andreassen and Tommaso 2018;Henderson 2022;Krishnakumar and Wendelspiess Chávez Juárez 2014).

Strengths
This article addressed a pertinent issue of evaluating the impact of programs on capabilities. In recent times, impact evaluation has become a common practice for programs aimed at enhancing people's well-being. The development of sound practices for evaluating programs' impact on capabilities is crucial in advancing knowledge in the subject. Our study endeavored to determine the extent to which researchers evaluating the impact on capabilities recognise specific challenges, how they address them, and the rationale behind their decisions. The findings revealed that there is still much to learn and comprehend regarding impact evaluation concerning capabilities. We used these findings in order to formulate specific recommendations that researchers may want to contemplate when designing and reporting capability impact assessments (Appendix D).

Conclusion: Capability, Justice, Responsibility
This paper set out to see to what extent authors recognise and address the methodological challenges that accompany capability impact assessment. Using the framework of Arksey and O'Malley (2005) for scoping reviews, we found 71 empirical studies that reported methodology and data on how interventions impacted the beneficiaries' capability. In these studies, there was generally much to be desired in areas of causal attribution, clear reporting on justification on capability content, and including the constitutive elements of capability.
Writing on the responsibilities that are associated with effective power, Sen observes that if someone has the power to make a difference that he or she can see will reduce injustice in the world, then there is a strong and reasoned argument for doing just that … Freedom in general and agency freedom in particular are parts of an effective power that a person has, and it would be a mistake to see capability, linked with these ideas of freedom, only as a notion of human advantage: it is also a central concern in understanding our obligations. This consideration yields a major contrast between happiness and capability as basic informational ingredients in a theory of justice, since happiness does not generate obligations in the way that capability inescapably must do, if the argument on the responsibility of effective power is recognized. (Sen 2009, 270-271) Although Sen seems to have individual citizens in mind here, the reasoning could also be applied to governments that, one might assume, have "the power to make a difference" in the sense described above. Evaluating programs for their potential to strengthen the capability of their target audience can be conceived as a means to help governments "see" where and how they can make a difference. It also draws its findings into the realm of justice: if a program can be demonstrated to help groups whose capability is compromised to overcome constraints and expand their capability, this is not merely a nice thing to do, but a moral obligation. We have seen that conducting such studies well is a huge challenge.
Given the complexities involved, programs that are enacted in order to expand the capability of its target audience are likely to be effective in only some of the members, in some respect, some of the time. Perhaps the key object of capability impact studies would be to better understand this heterogeneity, enabling to help people develop their capability more effectively in the future. Articulating program theory, making the content of capability explicit, providing justification for the proposed specification of capability, and paying more closely attention to causality issues are, in our view, promising ways of achieving this end.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Notes on contributors
Wouter Rijke, holding a Master's degree in Psychology, is a researcher whose doctoral research focused on the operationalisation and application of the capability approach. In particular, his research investigated the effect of deafness and hearing technologies on the capability of children, teenagers, and adults. Currently, he is a researcher at the research group Learning with ICT at the HAN University of Applied Sciences, where he designs and evaluates educational practices with and about ICT.
Jan Meerman studied Psychology (BSc) and Medicines (MSc) at Groningen University, the Netherlands. Currently he is a medical intern in Psychiatry at Dimence Groep. In the last decade he has conducted multiple projects in relation to the Capabilities Approach at Groningen University, in Sierra Leone, Tilburg University, Radboud University Nijmegen Medical Centre and is currently finishing his PhD at VU Amsterdam. His main topic of interest consists of the operationalisation and application of the Capability Approach in mental health care. He is the author of 2 other peer-reviewed scientific publications regarding a capability perspective on sustainable employability.
Bart Bloemen holds a BASc in bio-informatics, BA in philosophy and MSc degree in molecular life sciences. Currently, he is a PhD candidate and associate principal lecturer in Health Technology Assessment (HTA) at Radboud University Medical Centre, Nijmegen, the Netherlands. His research activities focus on the ethics of HTA: normative decisions involved in the conduct of a technology assessment, the interplay between facts and values in assessments, and the role of evaluative frameworks such as utilitarianism and the capability approach. He was also a member of the VALIDATE consortium, an EU Erasmus + project that developed an approach for integrating values in HTA.
Sridhar Venkatapuram is an inter-disciplinary academic-practitioner in global health ethics and justice. He is an Associate Professor at King's College London. Since early 1990s he has worked with WHO (HQ), NHS, Wellcome Trust, BMA, Human Rights Watch, and others. He lectures widely and publishes research on public health and global health ethics; global and health justice philosophy; capabilities approach; and social determinants of health. His Twitter handle is @sridhartweet.
Jac van der Klink MD, PhD, Msc is emeritus professor of mental health at work and sustainable employability at Tilburg University, Tranzo and extraordinary professor at North West University of South Africa, Optentia. He graduated in medicine, clinical psychology and in social and organisational psychology and he received postgraduate qualifications in general practice and occupational health. His PhD was on mental health in occupational practice.
He worked as a physician in Ghana and as General Practitioner in the Netherlands. Subsequently, he worked a.o. in occupational health practice and as scientific director at the Netherlands School of Public and Occupational Health (NSPOH). From 2006 till 2014 he was full professor of occupational health at the University Medical Center in Groningen.
His research focus is on mental health at work, on sustainable employability and on application of the capability model in these domains. He chaired the working group that applied the capability approach to work and developed the Capability Set for Work.
Gert Jan van der Wilt is a professor and head of Health Technology Assessment (HTA) at Radboud University Nijmegen Medical Centre. His research focuses on developing and validating HTA research methods, including n-of-1 trials, Bayesian methods, and ethical and social issues associated with healthcare technology. He is a member of various national and international committees, including the Editorial Board of the International Journal of Technology Assessment in Health Care and the Scientific Development and Capacity Building committee of HTAi. He has participated in various European projects on HTA and coordinated VALIDATE, a project funded by the EU for developing novel methods and frameworks for HTA. He has published over 250 peer-reviewed scientific publications and has an H-index of 43. In addition to his academic work, he was a visiting scientist at the TH Chan School of Public Health, Harvard University, and a fellow at the Institute of Advanced Study, University of Durham, and at the Fondation Brocher in Geneva.