Combatting Human Trafficking since Palermo: What Do We Know about What Works?

ABSTRACT In 2016, there were an estimated 40.3 million victims of modern slavery in the world, more than were enslaved during the Transatlantic Slave Trade. Since the adoption of the 2000 UN Trafficking Protocol, numerous efforts from inter-governmental agencies, governmental agencies, international non-governmental organizations (INGOs), and domestic non-governmental organizations (NGOs) have strived to combat the phenomena of human trafficking through legal-institutional means, direct interventions, and programs of support for those exploited. This anti-trafficking work has paid varying degrees of attention to the principles and methods of monitoring, evaluation, and impact assessment, but has often been subject to the end of project evaluations. Similar to findings of reviews of evaluations in the international development sector, evaluations of anti-trafficking programing have primarily focused on assessing the progress of project implementation and the achievement of outputs, rather than tracking the achievement of outcomes or impact. This is further complicated by the hidden nature of human trafficking and the trauma experienced by human-trafficking victims. As a consequence, despite some evidence of raised awareness and increased levels of funding, organizations are still struggling to demonstrate impact and discern what works to combat human trafficking. This article analyses the evaluations of counter-trafficking programing produced since the Protocol to draw conclusions regarding the lessons learned from these interventions and the methods used to monitor and evaluate human-trafficking programs. By highlighting gaps, this article provides a series of suggestions on how to better track progress and impact toward the elimination of modern slavery.


Introduction
Nearly 20 years on from the ratification of the Protocol to Prevent, Suppress, and Punish Trafficking in Persons Especially Women and Children (the Trafficking Protocol) (United Nations Convention against Transnational Organized Crime and the Protocols Thereto (CTOC), 2004), it is time to take stock and ask if we are any closer to answering the question, "what works to combat human trafficking?" There were an estimated 40.3 million people in forced labor and forced marriage in the world in 2016 (International Labour Organization, Walk Free Foundation, & International Organization for Migration, 2017). While it is difficult to estimate how many of these individuals were also victims of human trafficking, these remain the most accurate figures estimating related forms of exploitation, representing more people in slavery than during the Transatlantic Slave Trade (Voyages, 2019).
There have been numerous efforts to combat human trafficking and other types of exploitation at the international, regional, domestic, and local levels pursued by a variety of agencies and organizations. These groups combat human trafficking through legal-institutional means (international law and domestic statutory legal frameworks), direct interventions, and programs of support for those affected by exploitation. Taken together, these concerted efforts have paid varying degrees of attention to the principles and methods of monitoring, evaluation, and impact assessment.
Systematic review studies are one method to offer insight as to what is effective across comparable programs. Walk Free conducted one such review in 2015, called the promising practices database, identifying 179 evaluations of modern slavery programing and interventions to tackle related phenomena such as forced genital mutilation/cutting (FGM/C), child labor, and early marriage. The findings from this and similar reviews of anti-trafficking interventions so far show us that the quality of evaluations needs to improve before a clear understanding of what works can become a reality (Bryant & Joudo, 2018;Davy, 2015;2016;Hames, Dewar, & Napier-Moor, 2010;van der Laan, Smit, Busschers, & Aarten, 2011).
To take stock of anti-trafficking interventions and monitoring and evaluation methods, this article analyses the findings of a promising practices database (Bryant & Joudo, 2018), drawing on 90 evaluations of programs to combat human trafficking produced between 2000 and 2015. This article highlights some of the shortcomings of monitoring and evaluation in antitrafficking programing, some of which are common to international development more broadly and some of which are specific to anti-human trafficking interventions. For example, many evaluations have focused on assessing the progress of project implementation and the achievement of outputs, rather than tracking the achievement of outcomes or impact, while the concealed nature of human trafficking makes it difficult to establish baselines against which to track progress. As a consequence, organizations are still struggling to demonstrate impact and discern what works to combat human trafficking and modern slavery. 1 However, while concrete answers to the question "what works" remain elusive, a thematic analysis of the 90 evaluations points to some lessons learned regarding specific interventions and methods for conducting monitoring and evaluation. These practices, while not proven, point to some interesting "promising" practices of both interventions and methods of evaluation to reflect upon some twenty years since the ratification of the UN Trafficking Protocol.

A Note on Definitions and Terminology
This article refers to human trafficking as defined by the 2000 UN Trafficking Protocol, but recognizes that there has been a shift in recent years to refer to human trafficking as "modern slavery." Human trafficking remains the focus of this article, but where modern slavery is referenced, it is used as an umbrella term that encompasses many different forms of exploitation, including slavery, human trafficking, forced labor, debt bondage, forced or servile marriage, and the sale and exploitation of children. In this sense, we follow the International Labor Organization (ILO) and Walk Free definition that "essentially, it [modern slavery] refers to situations of exploitation that a person cannot refuse or leave because of threats, violence, coercion, abuse of power, or deception" (International Labour Organization, Walk Free Foundation, & International Organization for Migration, 2017;Walk Free Foundation, 2018).
Regarding monitoring, evaluation and impact assessment: international development projects, including those tackling human trafficking, should include a well specified set of objectives, milestones 1 In November 2018, the Gilder Lehrman Center for the Study for of Slavery, Resistance and Abolition held at Yale University held an international conference "Fighting Modern Slavery: What Works?" which brought together academics, NGOs, government agencies, and private sector organizations to examine different efforts to combat modern slavery. The conference covered much ground, but it was clear that despite the increased attention to the issue of modern slavery, much more work is needed to understand how and in what way anti-slavery interventions can make a demonstrable impact. within the overall framework, key performance indicators (KPIs), outcomes, and impact, along with a budget broken down into categories of expenditure over time. The process of monitoring is therefore dedicated to the periodic collection of information on the different aspects of the project to see if it is on course to deliver its main outcomes and impact, and whether its activities are within budget. Evaluation is related to monitoring, but adds an additional focus on project relevance, efficiency, effectiveness, impact, and sustainability, as well as demonstrable evidence that can be assessed and analyzed (OECD/ DAC, 1991). An evaluation is thus an assessment of the whole project cycle, and can be done with an inception phase, a mid-term review, and a final review. It focuses on both the achievements of the project and whether its activities had positive, negative, and/or unintended consequences. Alternatively, while few have been conducted in the anti-trafficking field, impact assessments 2 are designed to analyze the impact of purposive interventions; impact is defined as "positive and negative, primary and secondary long-term effects produced by a development intervention, directly or indirectly, intended or unintended." (OECD/DAC, 1991) and focuses on the "long term and sustainable changes introduced by a given intervention" (Blankenberg, 1995) rather than the immediate consequences. As such, impact differs from the outputs or the outcomes of an intervention and should be separately assessed, even though the three elements may significantly affect each other.

Materials and Methods
The promising practices database is a systematic review of evaluations of anti-modern slavery programing and related phenomena from 2000 to 2015; a 2019 update of the database has recently begun. 3 The starting point for the development of the database was to seek an answer to the question, "what works to combat modern slavery?" The theory was that since there have been many interventions and organizations involved in the fight against modern slavery since the ratification of the UN Trafficking Protocol, there must be some common lessons that can be drawn regarding what works.
The anti-modern slavery field is diverse and operates at different levels with various expertises. Inter-governmental agencies, such as the ILO and the United Nations Office on Drugs and Crime (UNODC) work to set standards, promulgate international legal instruments, and advocate for ending human trafficking through their formal institutional structures. These actions are reinforced by the 2015 Sustainable Development Goals (SDGs), including target 8.7, 4 16.2, 5 5.3, 6 and 10.7 7 (UN General Assembly, 2015). Government agencies take concrete steps to translate these international conventions and SDGs into legislative and policy action. An estimated 136 countries, for example, have enacted anti-trafficking legislation since the early 2000s (Minderoo Foundation, 2019). 8 International non-governmental organizations (INGOs) work to produce an independent evidence-base, awareness raising, advocacy, and programmatic delivery to help lift people out of human trafficking, and provide sustainable 2 The logic of impact assessments has evolved initially from Environmental Impact Assessment (EIA), to Social Impact Assessment solutions for survivors to be re-integrated into the economy or society of their respective (or, due to human trafficking's often transnational nature new) countries. Other organizations have an exclusive focus on supply chains as part of larger corporate social responsibility, sustainability, and business and human rights approaches, where the "Ruggie Principles" (OHCHR, 2011) are adapted to include the problem of modern slavery in business. At the local level, many secular 9 and faith-based civil society organizations (CSOs), as well as social movement organizations and networks 10 work with local governmental authorities, including law enforcement, to share information, raise awareness, advocate, and provide support in the community for human-trafficking victims. Given prior and existing activity, and as new actors such as the Freedom Fund and the Global Fund to End Modern Slavery enter into the anti-modern slavery policy area, the promising practices database aimed to draw out the lessons learned from the work conducted to date. Previous reviews of evaluations have identified a limited number of robust evaluations and impact assessments conducted in the anti-trafficking field (Davy, 2016;van der Laan et al., 2011). However, most projects and interventions to combat trafficking and related forms of exploitation are subject to log frame programmatic documents, donor reporting, mid-term reviews, and end of project evaluations. By broadening our search to include end of project evaluations, as well as any impact assessments that exist, it was hoped that lessons could be determined in terms of what works, and just as importantly, what does not work to combat modern slavery.
Throughout 2015, a team of researchers based at Walk Free conducted systematic searches of gray and academic literature to identify these evaluations. An evaluation was defined broadly to capture donor reports and end of project evaluations, using the definition: "evaluation measures progress towards outputs, or change in outcomes, or an assessment of an impact, of a development programme, policy, or intervention". 11 Various academic databases and international organization websites and databases were searched using the search criteria. Search terms included truncated versions of "traffic" and "evaluation", or "slavery" and "evaluation" or "forced labour" and "evaluation". Further evaluations were provided to the team by international organization partners after summaries were identified in relevant databases. Knowing that the number of evaluations would be limited, but that lessons could be learned from evaluations of interventions supporting groups vulnerable to modern slavery, search terms were expanded to include refugees, IDPs, migration, and forced genital mutilation/cutting (FGM/C). 12 Initial searches and review of titles and abstracts led to the inclusion of 410 potential evaluations, reduced to a total of 344 for review after the removal of duplicates. Another 165 were then removed, leaving a total of 179. These 165 were removed because they were descriptions of programmes, literature reviews, lists of good practice determined by other organizations, mid-term evaluations, formative (or pre-assessment) evaluations, not in English, a summary of a larger document which was not available, annual reports rather than evaluations, or did not include an explicit methodology of how the evaluations were conducted. Explicit methodologies were defined as the inclusion of a methodology section or a description of the actions taken by the individual or team to conduct the evaluation. Lists of good practice determined by other organizations were removed as these had inconsistent or contradictory criteria from the work being conducted of what constituted a good practice. Systematic reviews of evaluations, or reviews of reviews were also excluded from the final database. 13 The final 179 evaluations are housed in an endnote file and can also be downloaded from the Minderoo Foundation website. 14 9 The Global Modern Slavery Directory provides a list of 2,978 NGOs working on modern slavery by country. Once these evaluations were housed in the database, they were classified according to term lists organized under the following categories: • type of modern slavery, • sector of exploitation, • target populations, • country/ region, • type of program and activities, • independent/ internal, • evaluation methodology, • did the program meet its objectives?
Each evaluation also included a free text write up of the program objectives and evaluation findings.
The development of term lists was an iterative process that drew upon the content of the evaluations and predetermined terms that could be used to search the final database. 15 For example, the development of the term lists for program type and activities drew on existing antimodern slavery frameworks and the types of interventions as described by the evaluations. There is not an agreed international framework to describe interventions to tackle modern slavery. Combatting modern slavery draws heavily on trafficking frameworks, which predominately follow a criminal justice approach, as defined by the "3 P" framework of protection, prevention, and prosecution in the Trafficking Protocol (CTOC, 2004). The "3 P" framework is further iterated by Section 108 of the 2000 U.S. Trafficking Victims Protection Act (TVPA) and its subsequent reauthorizations (US Department of State, 2018) and frameworks and toolkits developed by UNODC (UNODC, 2008(UNODC, , 2009). However, not all modern slavery interventions follow a criminal justice approach, therefore, categorizing program type began by identifying the activities described by the evaluations and then grouping these activities into types of programs based on their commonalities. After testing a sample of the evaluations using these term lists, two members of the research team independently categorized all 179 evaluations. The final term list for types of intervention was: • supporting government, • service delivery and coordination, • research, • business transparency, • economic empowerment, • risk-based prevention.
A similar exercise by the interagency Coordination Group against Trafficking in persons (ICAT) in 2016 identified five intervention areas, which overlap with the Walk Free categories (in brackets in the below list). These were: (1) Increase identifications, increase relevance and quality of support, strengthen victimcentered responses, improve compensation. (service delivery and coordination) (2) Address context-based risk, develop safer migration pathways and actions, strengthen social protection, (risk-based prevention) (3) Improve data and data systems, increase learning, develop quality standards, (research) (4) Reduce enabling environment, shrink markets for goods/ services PBTL, clean supply chains, strengthen financial systems barriers, (business transparency/ economic empowerment) (5) Strengthen legal frameworks and norms, disrupt more trafficking networks, increase (strategic) prosecutions and confiscations. (supporting government) (ICAT, 2016) Terms to assess the evaluations' methodology were adapted from the Maryland Scientific Methods Scale. While the meta-evaluation literature has not reached a definitive consensus on the best means to compare the quality of methodology in outcome evaluations, the Maryland Scientific Methods Scale is the leading approach (Farrington, Gottfredsom, Sherman, & Welsh, 2002). It is a simple 5-point scale developed in the field of criminology to assist the assessment of the scientific validity of criminological interventions. It has the benefit of being a simplified framework to assess the methodology employed by evaluators, prioritizing internal validity, establishing causal order and eliminating external variables. As Table 1 illustrates, the minimum level of methodology for results to be considered reliable is level 3evaluation designed with pre-and posttest measures with a comparable control. The highest level of validity is attained by testing the intervention with a randomized control trial (RCT) (level 5). The use of this scale is not without controversy, in particular, because of the debate on the ethics and reasonableness of RCTs in development work (Burrell, 2012;Harkins, 2017). As described above, interventions in the anti-modern slavery sector are also diverse and cannot always be assessed using a criminological approach. A modified scale was used in the development of the promising practices database to suit this diversity better, borrowing from the Maryland Scale, but recognizing that pre-and post-assessments and RCTs are expensive and difficult to implement for anti-modern slavery work. The final methodology categories added in important factors for these evaluations, including participative elements and qualitative methods. There is scope for further work on a modified scale to apply to evaluations in anti-modern slavery work.

Results
Ninety evaluations in the database were categorized as "human trafficking". This occurred when "human trafficking" or "trafficking in persons" was included within the document regardless of whether the phenomena described constituted human trafficking according to definitions outlined in the Trafficking Protocol. Often the evaluation would not also include the original program document or only the briefest of descriptions of what the program aimed to tackle, so it was difficult to verify if the program was or was not attempting to combat human trafficking.
Of these 90 evaluations, the majority were conducted in Europe and Central Asia (n = 72), followed by Asia Pacific (n = 52) and Africa (n = 36) (refer to Figure 1). The fewest were conducted Adapted from Farrington et al. (2002).
in Arab States (n = 1) and the Americas (n = 20). 17 Eight evaluations were tagged as "global" and referred to evaluations of global programs or global frameworks to combat human trafficking. As evident in Figure 2, The number of evaluations identified has increased since the early 2000s, with peak numbers of evaluations conducted in 2006 (11) and 2012 (18). The low numbers of evaluations in 2015 are likely due to publication lag, as searches were completed by late 2015.
The most common types of interventions, as depict in Figure 3, were those that supported the government (n = 60), provided service delivery and coordination (n = 53), and risk-based prevention (n = 44), or a combination of all three (n = 20). These arguably align with the anti-human trafficking "3 P" framework of prosecution, protection, and prevention. The interventions aimed to tackle exploitation in the following sectors: sex work (n = 37), domestic work (n = 17), agriculture (n = 12) and begging (n = 10), among others (refer to Figure 4). Forty-three evaluations did not specify which sector the program was focusing on,  but instead stated that the interventions aimed to tackle human trafficking more generally. The categorization of different sectors was fairly generous and was based on the description of any sector within the evaluation, often found in the description of the program and its objectives. Evident in Figure 5, half the evaluations (n = 45) concluded that the program was successful, with the achievement of some program objectives or outcomes. The term "Inconclusive" was used when the   author of the evaluation determined that the results were inconclusive, while "Unclear" was used when the research team categorizing the evaluations were unable to determine if objectives had or had not been met. The standard of evaluation was fairly low (refer to Figure 6), with only two evaluations reaching level three on the Maryland scale; none were categorized above level three. The vast majority were tagged as posttest, without a control or comparison group, and with a qualitative review of documents, interviews and/or case studies produced during program implementation.
Finally, the majority of evaluations conducted were independent (refer to Figure 7), which is to say, they were conducted by a team or individual independent of the organization that implemented the program.

Evaluation in Practice in the Anti-trafficking Field
The promising practices database and the current literature on monitoring and evaluation of antitrafficking interventions establish patterns in current practice from which we can learn important lessons. Several systematic reviews have been conducted in the past decade, throwing light on the practice of evaluation in the sector and offering insight into gaps in our knowledge (Bryant & Joudo, 2018;Davy, 2015;2016;Hames et al., 2010;van der Laan et al., 2011). These studies support many of the conclusions of this paper.
First, it is apparent that the number of evaluations of anti-trafficking interventions are on the rise, but that evaluation is still seen as optional. While signs point to a growing practice of conducting evaluations, there is a dearth of publicly available evaluations of anti-trafficking programs and interventionsoften they are kept internal or not conducted at all (Davy, 2016). Similar findings appear even when the end of project evaluations are included. Knowing that there could be more than 1,500 active organizations 18 operating in the anti-trafficking and anti-modern slavery policy area, but that only 179 evaluations were identified is telling. This finding remains valid even with the inclusion of more recent evaluations. The initial searches for the 2019 update to the database found an additional 11 anti-trafficking evaluations in academic and international organization databases covering the period 2016-2019. 19 Further, many of the full-text 179 evaluations were not publicly available, with either only short summaries available on websites. Those that could be accessed after direct requests were made to organizations for specific evaluations were included in the database, while far more could not be located and were subsequently removed. If there are to be lessons learned from previous antitrafficking interventions, then the full versions of these documents should be made public.
Second, the evaluations appear not to meet established standards of evaluation, which can adversely affect their reliability and pose a barrier to evidence-based anti-trafficking work. In 73 instances, the anti-trafficking evaluations consisted of qualitative review of program documents, interviews with key stakeholders, and individual case studies. While there are benefits to the use of qualitative methods in evaluation (Patton, 2002), in practice some of these 73 evaluations were of poor quality in terms of methodology and content. Few evaluations, for example, triangulated their analyses or evaluated the reliability of documents or interviewees. The evaluations were also opaque regarding the methodology used or how their key findings were determined. There were some instances in the broader promising practices database outside of anti-trafficking interventions where an executive summary determined that a program was a success, which was not supported by a closer inspection of the full report. This limits the validity of the findings and the lessons that can be shared from these studies.
In recent years there are indications that the quality of anti-trafficking evaluations is improving. Of the 11 additional evaluations identified, five were level 3 and above on the Maryland Scale, two of which were RCTs (Archer, Boittin, & Mo, 2016;Gausman, Chernoff, Duger, Bhabha, & Chu, 2016). Those that were post reviews of project documentation, interviews, and end-line surveys were explicit about their triangulation methods. While the updated systematic review is not complete, this does point to a promising development in terms of quality.
The anti-trafficking evaluations in the promising practices database were often limited to process evaluation, giving a skewed sense of success without really informing whether an intervention is effective at producing results. The US Government Accounting Office (2011) defines four types of evaluation: (1) Process or Implementation Evaluation; (2) Outcome Evaluation; (3) Impact Evaluation; (4) Cost-Benefit and Cost-Effectiveness Analyses. Process evaluations look at whether the program was implemented according to plan, whereas Outcome and Impact evaluations go deeper into whether the program met its objectives for change. Twenty of the evaluations either determined that the program was "Inconclusive" in the achievement of its objectives or it was too difficult for the research team to determine if the program was a success (tagged as "Unclear"). This is in part due to the quality of program design. Often missing was an explicit theory of change underpinning program logic. This can lead to program objectives that are not clearly defined, are overly broad and impossible to evaluate or too narrow to reflect meaningful impact (Bryant & Joudo, 2018;Davy, 2016). Program objectives which are clear, meaningful, achievable, and measurable frame an entire project. They guide the development of baseline data, indicators for monitoring, and act as a yardstick for evaluation.
Of the 90 anti-trafficking evaluations in the database, 73 were conducted by independent evaluators. However, often little or no details were given on the author of the study, their relationship to the organization or program being evaluated, or how they were funded, which may suggest that the number of independent evaluations is much lower. This is supported by Davy (2016), who found that many evaluations had been undertaken by program staff rather than an external, independent evaluator, especially in smaller NGOs. The relationship between the evaluator and the implementor of the project remains unclear in the 11 more recent evaluations. Internal 19 Initial searches were conducted for truncated versions of "trafficking" and "evaluation" of academic, 3ie, UNODC, ILO, IOM, and DEC databases. These searches took place in August 2019 and will continue throughout latter part of 2019. evaluations are more prone to bias, and not publishing the details of the evaluator points to a lack of transparency which affects reliability.

Barriers to Effective Monitoring and Evaluation and Impact Assessments
In many instances, the findings from the review of 90 anti-trafficking evaluations highlight the same barriers to conducting monitoring and evaluation and impact assessments across the international development sector. Limited resources, expertise and data, bias in methodology, ethical considerations regarding the use of RCTs, attribution challenges in conducting impact assessments, political constraints, and short-term project timescales are all barriers which prevent thorough evaluation of international development programing (Bamberger, Rugh, & Mabry, 2006, including antitrafficking projects. Evaluations can be costly and consume many staff hours, especially when designing and implementing impact assessments. If insufficient budget has been set aside for this work, it is difficult to achieve a consistent and rigorous approach (Bamberger et al., 2006(Bamberger et al., , 2012. These same issues are reflected in evaluations conducted of anti-trafficking interventions. Cost can play a role in identifying who should be involved in the evaluation process (Gallagher & Surtees, 2012), and limit the methodologies that can be employed.
Furthermore, monitoring and evaluation conducted without expertise or training on data collection or analysis, or without proper care and planning, can impede results (Bamberger et al., 2006(Bamberger et al., , 2012. This is often the case where evaluations are conducted in-house rather than by an external expert, and the weaknesses show in the methodology of international development and antitrafficking evaluations alike. Aside from issues of methodology, independence in evaluations is essential for non-biased results. Bias is more likely in evaluations conducted where funding is dependent on a positive review rather than as part of a culture of learning and improving program planning based on real evidence and measurable experience. Competitiveness for scarce resources leads to inflation of results and suppression of lessons learned regarding initiatives that may be unsuccessful. This competitiveness affects the anti-trafficking sector. A 2018 report by United Nations University revealed that Official Development Assistance to tackle modern slavery, forced labor, human trafficking, and child labor, while an increase from US$150 million in 2001, stood at approximately US$433.7 million by 2013 (Gleason & Cockayne, 2018). By way of comparison, the ILO estimated in 2014 that the profits from forced labor alone equaled US$150 billion (International Labour Office, 2014). Lessons that initiatives are unsuccessful are arguably as important as more positive learnings; it is therefore important that a safe space is created where these findings can be shared. Viewing evaluation as a separate activity to program design also hinders a culture of learning, by preventing the collection of baseline data, or incorporating findings into a cyclical approach to program design.
Running an RCT can significantly strengthen the findings of a study and allows an appreciation of the actual impact the intervention made, controlling for other factors (CIMA Global Academic Research Programme, 2017; Gausman et al., 2016). However, it poses great ethical challenges (Burrell, 2012) that should be carefully considered in any development program, including human trafficking. Even once conducted, measurement and attribution challenges due to the complex variables to assess (CIMA Global Academic Research Programme, 2017) can also be found in impact assessments of trafficking programing (Gallagher & Surtees, 2012). Finally, international development and anti-trafficking interventions usually depend on short-term funding from a variety of sources, which means programs can only be designed to fit short-term needs, thereby preventing longitudinal studies examining the long-term impacts of a program on beneficiaries and other stakeholders.

Barriers to Effective Monitoring and Evaluation and Impact Assessments in the Anti-trafficking Sector
Despite these commonalities, there are some barriers identified in the review of the evaluations in the promising practices database that are more specific to anti-trafficking interventions. For example, the concealed nature of human trafficking makes it difficult to estimate and establish baselines against which to measure. Estimates of human trafficking are conducted by using multiple systems estimation (MSE), a form of capture-tag-recapture using administrative datasets (Bales, 2017;Bales, Hesketh, & Silverman, 2015;Bales, Murphy, & Silverman, 2019, 2018b, 2018cVan Dijk, Cruyff, & van der Heijden, 2018a;van Dijk, van der Heijden, & Heerdink, 2016), random-sample surveys (Joudo Larsen & Diego-Rosell, 2017;Pennington, Ball, Hampton, & Soulakova, 2009;Zhang, 2012), and/or other forms of reporting on trafficking incidents. MSE has not yet been used to track change over time, while random sample surveys are costly to implement. Any impact assessment, which would aim to show reduction in prevalence over time, is therefore challenging to implement. Accounting for displacement is a related issue, when, in response to an intervention, trafficking patterns move elsewhere. While surveys could identify displacement when conducted in adjoining areas, the location where trafficking is displaced to is often difficult to predict.
Issues of displacement are highlighted by the evaluation of International Justice Mission's program "Project Lantern" in Cebu, Philippines. The project provided capacity building for law enforcement, prosecutors, judges, court personnel, government officials and service providers; developed good practice training manuals, guidebooks, and curricula; procured resources, including equipment and supplies for law enforcement, prosecutorial officials and aftercare staff; and mobilized civil society to put pressure on the justice system to take targeted action to combat human trafficking. The evaluation concluded that there was an observed reduction in the availability of children for commercial sexual exploitation in Cebu, and that Project Lantern contributed to this change (Jones, Schlangen, & Bucoy, 2010). However, qualitative data collected during the evaluation confirmed that while children were less visible, this did not necessarily signify a reduction in trafficking. Instead, trafficking had potentially been driven underground or driven out of metro Cebu to other areas. Others pointed to the arrival of the "cottage porn industry" in private homes around the timeframe of the project, which may also explain the reduced visibility of children exploited for commercial sex (Jones et al., 2010).
A further barrier lies in the complexity of human trafficking crimes, which can limit the identification of long-term changes and impact. Human trafficking is a multifaceted crime, which encompasses many perpetrators, routes, sectors, victims, and forms of exploitation. However, nearly 50% of the evaluations (n = 43) did not specify which sector or form the program was focusing on, but instead stated that the interventions aimed to tackle human trafficking more generally. This highlights one of the weaknesses with the design of trafficking programs and subsequent evaluationswithout clear understanding of the sectors or forms of exploitation that the program was aiming to address, it is difficult, if not impossible, to identify exactly which change is desired and therefore which intervention will or will not have an impact.
This complexity is also difficult to translate into survey instruments for the purposes of pre-and post-assessments. For instance, the Harvard FXB Center for Health and Human Rights conducted a comprehensive impact assessment of the Freedom Fund's project in Northern India aiming at eradicating forced labor and increasing socio-economic benefits in the targeted community (Gausman et al., 2016). The study, conducted 3 years after the beginning of the project, benefitted from a strong methodological framework, using both qualitative and quantitative indicators, a baseline study and a comparison group. The comparison group, with similar features, did not receive any direct intervention until later in the implementation of the program (Gausman et al., 2016). The end-line survey found no evidence of trafficking in any of the groups that received either full intervention, partial intervention, or no/limited intervention. Qualitative interviews conducted at the same time, however, revealed that trafficking was a problem, suggesting that the lack of evidence at the end-line could be due to the hesitation in "providing frank answers or difficulty in understanding exactly what was being asked" (Gausman et al., 2016).
The level of trauma victims of human trafficking experience can make an effective evaluation more difficult. Several studies (Altun, 2017;Hossain, Zimmerman, Abas, Light, & Watts, 2010;Kiss et al., 2015) prove that those who have experienced trafficking show high levels of depression, anxiety, and post-traumatic stress disorder. This can preclude any ability or wish to engage in evaluation processes and can hinder longitudinal studies. One exception to this is a ten-year research project launched by Chab Dai in 2010 to better understand reintegration for survivors of trafficking for sexual purposes. The project releases reports one or two times a year, providing a balance of current, continued outputs with long-term investment in research on the process of reintegration (Tsai, Vanntheary, & Channtha, 2018). While the study boasts high retention rates -76% in 2013 20 (Miles, Sophal, Vanntheary, Channtha, & Phally, 2014)it attributes this to the high levels of trust built during the course of the project. This trust was built by conducting interviews three times per year in the initial year of study, maintaining a database of participants' contacts, as well as "being available by phone for contact 24/7" (Miles et al., 2014). While commendable, such resource intensive approaches remain the exception rather than the rule.
A final barrier is the specific unintended consequences of anti-trafficking interventions. Much has been written of the linkages between human trafficking programing and anti-sex work policies (Kempadoo, Sanghera, & Pattanaik, 2005;Lerum & Brents, 2016), while recent improvements in governmental responses to modern slavery (Minderoo Foundation, 2019), have been undermined by the concurrent creation of hostile migration policies that exacerbate vulnerability (David, Bryant, & Joudo Larsen, 2019;Galos, Bartolini, Cook, & Grant, 2017;Minderoo Foundation, 2019). Accordingly, scholars and practitioners have highlighted the importance of conducting impact assessments on the unintended consequences of an anti-trafficking intervention (Gallagher, 2007; Global Alliance Against Trafficking in Women (GAATW), 2007), including adverse impact on the enjoyment of human rights. Conducting a human rights impact assessment is then crucial to be able to identify violations of the right to liberty after having freed enslaved or trafficked women (GAATW, 2007) or the adverse impact of anti-trafficking programmes on the rights to health of sexworkers (Ahmed & Seshu, 2012), as just two examples of potential human rights violations.

So, 20 Years On, Do We Know Nothing about What Works?
The ability to draw concrete conclusions regarding what works to combat human trafficking is hampered by the quality of evaluation of anti-trafficking programing and subsequent evaluations. While there appears to be an improvement in the quality of evaluations conducted since 2016, the number of quality evaluations remains limited.
However, it is not the intention of this paper to throw out all findings from the review of the antitrafficking evaluations. To disregard all programing and evaluations is counter-productive and leaves us no further along in our understanding than we were in the year 2000. A thematic analysis of the evaluations in the promising practices database provides us with some potential lessons learned. While not strong enough to conclusively answer the question "what works," it provides a snapshot of where there might be some promising practices for future interventions.
We have also learned much about the specificities of conducting monitoring and evaluation in the anti-trafficking sector. The difficulties encountered does not mean that we should not conduct monitoring and evaluation, preferring instead to focus limited resources on interventions and support for survivors. Monitoring, evaluation, and impact assessment are based on the careful specification of aims and objectives for ending human trafficking or addressing a core component of exploitative practices, measuring and monitoring the progress of a particular intervention, and providing a framework for understanding the overall impact of anti-trafficking work. Without these 20 Most recent data that could be found. evaluations, it is not possible to target limited resources and provide the means through which to judge whether anti-trafficking efforts are having an impact on those affected, bringing about a strengthening of prevention mechanisms, and ultimately a reduction in prevalence. Based on the review of the methodologies underpinning these evaluations, this paper concludes by highlighting some monitoring and evaluation methods that require further investigation.

Conducting Raising Awareness Campaigns
The use of raising awareness campaigns in anti-trafficking programing is common. They tend to include educating the public on definitions or indicators of human trafficking and how to report incidences, as well as more targeted interventions highlighting risks for specific groups. Of the 90 in the database, 47 evaluations described programs that included some form of raising awareness campaign. Many of these evaluations conclude that raising awareness campaigns had mixed results, but that in order to be effective, campaigns need to be targeted to particular groups, have a clear message, and be adapted to local contexts.
For example, a 2006 evaluation of International Organization for Migration (IOM)'s information campaigns in Cambodia made limited use of pre-and post-assessment and a control group. It found that awareness levels increased as a result of a mass information campaign and village-based activities, in that participants were better able than control groups to repeat the main messages from the campaign and to recite the hotline number and the penalties associated with trafficking crimes. However, the evaluation also found that there was limited understanding of what trafficking constitutes and highlighted the need for a more user-friendly trafficking definition (Sainsbury, 2006). A 2014 evaluation of UN Women's anti-human trafficking program in India also concluded that the program's raising awareness campaign reported mixed results, with some target audiences reporting that they had received the information, and others reporting that they had not (UN Women, 2014). A review of MTV Exit, a large-scale multi-media campaign designed to raise awareness of human trafficking in the Asia-Pacific region, revealed that programs "need to have clear and targeted messaging with a call to action and that promotes behaviour change" (Skuse and Downman, 2012). A 2012 evaluation of an anti-trafficking campaign in Kosovo also determined that there was a slight increase in knowledge on human trafficking, and that more information should be targeted to the affected groups and communities in a clear and simple manner (Kuneviciute, Waeyenberge, Viaene, & Moens, 2012).
The literature and an evaluation conducted in 2016 support these findings. A 2017 review of 55 demand-side raising awareness campaigns (5% of which had been evaluated) found that these campaigns needed to go beyond awareness raising to target specific behaviors. It concluded that campaigns only have a positive impact on reducing exploitation and human trafficking in combination with other interventions, if at all (Vogel & Cyrus, 2017). The 2016 RCT of awareness raising campaigns in Nepal revealed more positive findings, concluding that human trafficking awareness campaigns increase the ability of respondents to self-identify as having been trafficked and to recognize the occurrence of human trafficking among family and friends (Archer et al., 2016). The campaigns that were subject to the RCT were highly adapted to the Nepalese context; before the campaign was implemented, the team researched the different media types and respondent access in Nepal, interviewed NGOs that had conducted similar campaigns in the past, gathered stories through source interviews, and piloted the materials (Archer et al., 2016).

Strengthening Legislation
Forty-six evaluations described programs that included elements of technical support to governments to strengthen legislation, among other outcomes. Many of these evaluations highlighted the importance of local ownership and embedding trafficking interventions in existing national structures. UNODC's global programme, GLOT55, aimed to promote the UN Trafficking Protocol in 55 countries by providing assistance to selected states to strengthen their legal frameworks; conducting national and regional trainings; improving support for victims; and supporting raising awareness activities (Marshall & Berman, 2013). The evaluation commended the program's flexibility, which allowed countries to build national ownership of the response to trafficking and highlighted that the program did support many countries to begin the process of drafting national legislation. However, there was no concrete change in the number of ratifications or the passage of national legislation during the program's implementation. Further, the global design prevented the adaptation of activities by local stakeholders. The importance of local ownership and the necessity for broad consultation with key stakeholders was also highlighted in an evaluation of an UNODC project in South Africa, which was implemented between September 2009 and March 2013. Legislation was not passed during the project timeframe, while there remained uncertainty if the establishment of a Border Management Agency, one of the activities of the project, would be achieved. External factors, such as hosting of the World Cup in 2012, led to project delays (Plessis, 2013). Comprehensive anti-trafficking legislation was passed in South Africa just a few months later in July 2013 (Act No. 7 of 2013: Prevention andCombating of Trafficking in Persons Act, 2013).
What these evaluations reveal is not unique to these projects. Lack of national ownership, limited consultation with key stakeholders, and the need for longer program timeframes to ensure the sustainability of results was common to a number of evaluation findings in the database.

Supporting Survivors
Central to most anti-human trafficking programing are interventions to identify and provide support to trafficking victims. Fifty-three evaluations in the database include initiatives such as training for front line service providers, case management, medical support, shelters, support groups, a hotline, and longer-term reintegration services such as education and job placement programs.
Fourteen evaluations in the database were tagged as assessing programs that included some form of case management. A significant number of these were implemented in the U.S. (n = 9), followed by the Philippines (n = 2) and the UK (n = 2). Case management is defined as the provision of coordinated and comprehensive services to victims of modern slavery, where that assistance is tailored to an individual's needs, as opposed to a more a general, "one size fits all" approach to victim assistance . A 2014 evaluation of services provided to victims of human trafficking in New York used participant observation, review of documents, and interviews with key staff to determine the effectiveness of a trauma-focused approach, concluding that, "being trafficked is in itself a traumatic experience, then a trauma-informed care framework for providing services to traumatised individuals makes sense" (Heffernan & Blythe, 2014). The importance of trauma-focused care was also central to the Oshkiniigikwe program, which provides support services to American Indian and Alaska Native women, adolescent girls and their families. The evaluators conducted interviews with clients at entry point into the program and again six months later to determine that delivering services in culturally sensitive ways and supporting victims to strengthen their sense of identity were an integral part of case management (Pierce, 2012). Finally, a 2014 evaluation of NGO programing in Albania, Kosovo, Serbia, Romania, and Bulgaria, highlighted the importance of longer term support for victim assistance initiatives to allow sufficient time for recovery (Surtees, 2015).

Improving Monitoring and Evaluation of Anti-trafficking Interventions
Reflecting on barriers to monitoring and evaluation and the specific difficulties regarding evaluations of anti-trafficking interventions highlights some interesting approaches to be taken into account when conducting future evaluations of human-trafficking programing.
The ultimate goal of many trafficking programs in the promising practices database is a reduction of prevalence. However, given the difficulties in measuring impact, and the ethical issues surrounding the use of RCTs, this should not be the focus of every anti-trafficking intervention. Nor should we raise the expectation that every project should necessarily reduce prevalence. Tracking progress toward proxy indicators, such as change in behaviors or a reduction in risk factors and strengthening in protective factors, may be more achievable for many projects. Measuring prevalence would then become the remit of impact assessments of programs targeting whole sectors or geographic regions. Alongside this, and for prevention projects, in particular, there is also a need to expand understanding of human trafficking to assess the myriad of contextual factors and underlying systems that combine to cause human trafficking. For example, any impact assessments of labor trafficking projects should look at labor migration policies, social protections, and transparent labor recruitment methods (Kiss & Zimmerman, 2019). This also requires a change in donor behavior to prioritize robust evaluation and see impact evaluation as a tool to assess these broader interventions.
RCTs are now being conducted, as shown by the RCT of raising awareness campaigns in Nepal (Archer et al., 2016), community empowerment interventions in India (Gausman et al., 2016), and forthcoming studies in northern and southern India (The Freedom Fund, 2019). In India, ethical issues were resolved by comparing impact on three cohorts: (1) hamlets where the intervention was considered "mature" (full intervention), (2) where the intervention was not considered "mature", but where work was ongoing (partial intervention), and (3) those where the intervention was limited and the intervention took place toward the end of the study period (comparison group) (Gausman et al., 2016). In other policy areas, RCTs are used in a lagged fashion, so that interventions are implemented at a later time in those areas where one was not completed in the first instance. This is one interesting method which should be applied more rigorously in the anti-trafficking field.
Despite donor preference for quantitative methods in anti-trafficking evaluations (Gallagher & Surtees, 2012), qualitative methods to evaluation can lead to improved understanding of the effectiveness of projects. Techniques such as observation, participant observation, key informant interviews, and focus groups can give us insight into changes or lack thereof. These techniques can be successful as long as these are conducted critically and by triangulating with other data sources (Patton, 2002). Mixed method approaches, however, remain the ideal for conducting impact assessments.
Survivors should play a central role in monitoring and evaluation of anti-trafficking programing. Participatory approaches can be a promising approach to evaluation that improves retention rates and acts as a vehicle for rehabilitation. Studies on Freedom Fund's programme in Bihar and Uttar Pradesh, for example, utilized life stories and participatory statisticsa participatory method to collect and analyze data to assess whether the program had made a real contribution to the eradication of modern slavery and bonded labor in the targeted communities (Burns, Oosterhoff, Raj, & Nanda, 2015;Oosterhoff et al., 2016). The participatory model to impact assessment has been generally considered effective in the anti-modern slavery field since it engages with local communities and survivors to better understand the long-term impact of the intervention on their lives (International Council on Human Rights Policy, 2011). A participatory model can "give people affected by slavery a voice about what should be counted and gives them a chance to input into how the survey results could be used for locally relevant action" (Oosterhoff et al., 2016). While promising, it is still important to assess which stakeholders to include in this process, the role of psycho-social support during the evaluation process, and how to provide adequate and uniform training for those who participate.

Conclusion
Drawing this together, nearly 20 years on from the ratification of the UN Trafficking Protocol, we do not have concrete answers to the question "what works" to combat human trafficking. However, there are interesting lessons learned and suggestions on how to strengthen our use of monitoring and evaluation. The quality of anti-trafficking evaluations appears to be improving, with increased use of impact assessment, as part of a toolkit of monitoring and evaluation techniques. Its use, however, only makes sense when evaluating the impact of a significant program or an organizational theory of change. Use of participatory statistics and other participatory models are an important development to tracking impact of anti-trafficking programing. Qualitative data collection techniques should not be dismissed and can highlight findings when evaluators triangulate data and critically engage with program documents and interviewees. Donors should reflect the importance of monitoring and evaluation by providing for longer timeframes for calls for proposals and program implementation as well as resources to conduct monitoring and evaluation.
Despite the limitations of the methodologies used in anti-trafficking interventions, there are some common findings from existing evaluations that allow us to draw some conclusions regarding what works. The impact of raising awareness campaigns is limited when these are not targeted to specific communities with a clear message. Support for governments to pass legislation can be deemed to have had an impact, but only when there is national ownership and sufficient time allocated to reflect the length of time it takes to implement legislative and policy change. Support for victims is effective when it is victim-centered, applies a traumafocused lens and prioritizes the sense of identity of the victim.
With an estimated 40.3 million people in some form of modern slavery in the world and a global commitment to end human trafficking and modern slavery by 2030, it is important to understand "what works" in the fields of anti-trafficking. Across anti-trafficking work, real positive change needs to be demonstrated through effective use of baseline data, the setting of real and deliverable outcomes that can be measured in some way, and a framework for analyzing the contribution and direct impact of purposive interventions designed to combat human trafficking and modern slavery. In the absence of tangible evidence and well thought out programs that can be evaluated, anti-trafficking and anti-modern slavery interventions will struggle in their efforts to end modern slavery in all its forms and to maintain crucial support from a variety of donors and funders within a highly competitive environment.
-Child marriage OR -Sale or exploitation of children OR -Use of child soldiers OR -Child labor OR -Prostitution OR -Refugees OR -Internally displaced persons OR -Female genital mutilation OR -Safe migration OR -Labor migration A total of 1,787,748 sources were identified from these searches, largely due to the number of sources identified through internet searches. For internet searches that revealed a large number of evaluations, the team reviewed the first 10 pages for relevant evaluations. From all of our searches, a total of 410 evaluations were included in Endnote. These were then reviewed for duplications, leaving 344 evaluations in the library.
After further review, another 165 were then removed, leaving a total of 179. These were removed because they were descriptions of programs, were a literature review, were lists of good practice determined by other organizations, were mid-term evaluations, were formative (or pre-assessment) evaluations, were not in English, were a summary of a larger document which was not available, were annual reports rather than evaluations, or did not include an explicit methodology of how the evaluation was conducted. Explicit methodologies were defined as the inclusion of a methodology section or a description of the actions taken by the evaluation team; those evaluations missing this were also excluded. Lists of good practice determined by other organizations were removed as these had inconsistent or contradictory criteria from the work being conducted of what constituted a good practice. Systematic reviews of evaluations, or reviews of reviews were also excluded from the final database. However, systematic reviews, literature reviews, and annual reports were used to inform the drafting of individual policy papers where interesting lessons could be drawn.