Method for Observing pHysical Activity and Wellbeing (MOHAWk): validation of an observation tool to assess physical activity and other wellbeing behaviours in urban spaces

ABSTRACT Direct observation of behaviour offers an unobtrusive method of assessing physical activity in urban spaces, which reduces biases associated with self-report. However, there are no existing observation tools that: (1) assess other behaviours that are important for people’s wellbeing beyond physical activity; (2) are suitable for urban spaces that typically have lower numbers of users (e.g. amenity green spaces) or that people pass through (e.g. green corridors); and (3) have been validated in Europe. MOHAWk (Method for Observing pHysical Activity and Wellbeing) is a new observation tool for assessing three levels of physical activity (Sedentary, Walking, Vigorous) and two other evidence-based wellbeing behaviours (Connect: social interactions; Take Notice: taking notice of the environment) in urban spaces. Across three studies, we provide evidence that MOHAWk is reliable and valid from 156 hours of observation by six observers in five urban spaces in the UK. MOHAWk can be used in policy or practice (e.g. by local authorities or developers), or in more formal institutional based research projects. This new tool is an inexpensive and easy-to-use method of generating wellbeing impact evidence in relation to the urban physical or social environment. A manual providing detailed instruction on how to use MOHAWk is provided.


Background
Physical activity provides many health benefits for all age groups (UK Department of Health 2011). Despite this, most of the world's population is not sufficiently active to gain significant health benefits (Lee et al. 2012). Characteristics of the urban environment (e.g. green space, street design) can influence population levels of physical activity (Bauman et al. 2012). Two thirds of the world's population predicted to be living in urban areas by 2050 (United Nations 2015). This growing urbanisation highlights the importance of understanding how the urban environment can facilitate (or inhibit) physical activity.
Systematic observation (i.e. direct observations of behaviour using predetermined criteria) is one method for assessing physical activity that offers many advantages. Systematic observation is an unobtrusive method; that is, participants are generally unaware that their behaviour is being assessed. This reduces possible reactivity of measurement associated with self-report and device-based measures of physical activity (French and Sutton 2010), reducing the risk of social desirability and recall bias. Systematic observation is not susceptible to poor response rates associated with self-reported measures (Benton et al. 2016a), reducing the risk of selection bias. Systematic observation provides contextually rich data by assessing behaviour directly in the environment of interest. Assuming ethical approval, observations can be carried out in almost any publicly accessible urban environment.
There are several existing observation tools designed to assess physical activity in urban environments (e.g. (Mckenzie et al. 1991, McKenzie et al. 1992, 2000, 2006, Gehl and Svarre 2013, Gehl Institute 2017, Suminski et al. 2019). The most widely-used validated observation tool for assessing physical activity in outdoor urban environments is System for Observing Play and Active Recreation in Communities (SOPARC) (McKenzie et al. 2006). SOPARC uses momentary observational scans to assess the characteristics and physical activity behaviours of people in the area being observed. It has provided valuable data on a range of parks in various settings and populations (Evenson et al. 2016).
However, there are three key reasons why a new observation tool is needed. First, SOPARC only assesses physical activity behaviour; it therefore does not capture other behaviours undertaken by people in urban environments, such as social interactions (Helliwell and Putnam 2005). Second, SOPARC has only been validated in neighbourhood and state parks (McKenzie et al. 2006, Cohen et al. 2011, Whiting et al. 2012. SOPARC uses single observational scans that only last several seconds, which are less likely to capture valid samples in urban spaces that typically have lower numbers of users (e.g. amenity green spaces) or that people pass through (e.g. residential streets). Third, SOPARC was initially developed in California, United States (US), and has since only been validated in the US. Although SOPARC has been used around the world, it has predominantly been used in the US; across two recent review of studies using SOPARC, only two studies were conducted in Europe (Turkey and Belgium), compared to 23 unique studies in US (Evenson et al. 2016, Joseph andMaddock 2016). This is an important issue because the US is different to many European cities in terms of contextual variables that affect physical activity (e.g. population density, city design, climate, population characteristics) (Tucker andGilliland 2007, Sallis et al. 2016); therefore, features of the tool may be unsuitable for settings dissimilar to the US. One study has used SOPARC in the United Kingdom (UK) (Gidlow et al. 2010); however, the researchers had to modify the tool in several ways (e.g. used continuous scanning throughout the observation period, modified demographic categories) and therefore of uncertain psychometric properties.
Existing observation tools have been used to assess physical activity. However, observation tools could be used to assess a wider range of behaviours that are known to influence wellbeing, beyond physical activity. Such behaviours have been identified in the 'Five Ways to Wellbeing' (New Economics Foundation 2008). On behalf of the UK Government's Foresight programme, New Economics Foundation (NEF) conducted a review of the wellbeing literature. They identified five behaviours for which there is evidence that engaging in these behaviours improves an individual's wellbeing, known as the 'Five Ways to Wellbeing' (or 'Five Ways'): Be Active (engage in physical activity); Connect (socially interact with others); Take Notice (be aware of the environment); Keep Learning (acquire knowledge or skill in something new); and Give (contribute to the community). Since there is evidence that each of the Five Ways behaviours are associated with improved wellbeing, including both hedonic and eudaimonic wellbeing (Helliwell and Putnam 2005, New Economics Foundation 2008, Dolan et al. 2008, McEwan et al. 2019, each of these behaviours can be used as indicators of wellbeing (hereafter referred to as 'wellbeing behaviours'). Three of the Five Ways behaviours (Be Active, Connect, Take Notice) can be observed and are relevant to urban environments.
We report here on the development and formal testing of a newly developed observation tool: Method for Observing pHysical Activity and Wellbeing (MOHAWk). An early version of this tool was used in a recent study that evaluated the impact of small-scale pocket park improvements in Manchester, UK (Anderson et al. 2017). The researchers observed significant increases in wellbeing behaviours assessed using MOHAWk at 1-year follow-up compared to a matched comparison site; thus demonstrating the feasibility of using this tool and evidence of sensitivity to change.

Purpose of MOHAWk
MOHAWk is an observation tool for assessing three levels of physical activity (Sedentary, Walking and Vigorous) and two other wellbeing behaviours (Connect: social interactions; Take Notice: taking notice of the environment) in urban spaces. It also measures the total number of people, their characteristics (gender, age group, ethnicity), and the presence of incivilities in the environment where observations are carried out (e.g. graffiti, broken glass). MOHAWk has been designed to be used in a wide variety of urban spaces, particularly spaces that typically have lower numbers of users or that people pass through; examples of which may include residential streets, amenity green spaces, green corridors, pocket parks, and urban squares.
MOHAWk is different from existing validated observation tools in three key ways: (1) MOHAWk assesses two other wellbeing behaviours that are relevant to the use of urban spaces (social interactions and taking notice of the environment), not just physical activity; (2) MOHAWk observations occur continuously throughout the observation period, rather than using a series of single observational scans; and (3) MOHAWk observations are carried out regardless of weather conditions, rather than cancelling observations during inclement weather (a sensitivity analysis, or including weather as a covariate, can control for the confounding influence of weather).
MOHAWk is freely available for use. The tool consists of an instruction manual, a standardised observation form, and a data summary form -all of which are provided in Supplementary files 1, 2 and 3. An overview of MOHAWk and procedures for using the tool are summarised in Figure 1.

Aims of present research
The present paper reports on three studies that aimed to develop MOHAWk and test for evidence of reliability and validity. The specific aims of these studies were to: (1) assess inter-rater reliability for observing people's characteristics, physical activity levels, and additional wellbeing behaviours (Connect; Take Notice); (2) explore the reliability of shortened observation schedules; and (3) test for evidence of criterionrelated validity of observing Take Notice behaviours.

Study 1
MOHAWk was used in two sites in central Manchester, UK: All Saints Park (Site 1A) and St Peter's Chaplaincy (Site 1B) -see Figure 2. Site 1A is a small park (~0.9 hectares) that is surrounded on three sides by Manchester Metropolitan University.
Site 1B is a non-green urban square located near The University of Manchester.
One observer (MP) used MOHAWk at these two sites across eight weekdays (Thursdays and Fridays) over four weeks during March 2017. Observation periods lasted two hours (8-10am and 2-4pm), but data were corded into 15-minute blocks within each observation period to allow investigations of patterns of data within each observation period. Observations were fully counterbalanced between the two sites to control for week, day of week and time of day. To assess inter-rater reliability, on four days of observation, a second observer (DF or JA) independently conducted observations alongside MP at the same

Overview of MOHAWk
MOHAWk uses continuous scanning to record the characteristics and behaviours of each person entering a predetermined boundary ('target area') during hour-long time periods ('observation periods'). Data are recorded using pen and paper. Observers use a standardised observation form (provided in Supplementary file 2) to record the following information for each person that enters the target area during each observation period: gender (Male or Female), age group (Infant, Child, Teen, Adult or Older Adult), ethnicity (White or Non-white), physical activity level (Sedentary, Walking, Vigorous), social interaction (Connect or No Connect), taking notice of the environment (Take Notice or No Take Notice), activity type (Cycling, Using phone, Dog walking, or other predetermined activities), and if mobility assistance is required (Yes or No).
The presence of the following 'incivilities' in the urban environment are also recorded: general litter, evidence of alcohol use (empty bottles/ cans), evidence of drug taking (e.g. needles, syringes), graffiti, broken glass, vandalism, dog mess, noise. These items were taken from an existing validated tool for assessing the quality of neighbourhood green space (Gidlow et al. 2012). A wider range of tools for auditing the physical environment are available elsewhere for a variety of purposes, such as assessing the quality of parks (e.g. EAPRS (Saelens et al. 2016)), public open spaces (e.g. POST (Broomhall et al. 2004)), and pedestrian streetscape (e.g. MAPS (Millstein et al. 2013)).
Observations are carried out regardless of weather conditions, unless weather conditions become so extreme that they compromise the observer's safety. To control for potential bias associated with weather, a sensitivity analysis is carried out to assess the impact of weather; specifically, precipitation. The duration of any precipitation that occurs during an observation period is recorded by the observer. Observation periods are removed for the sensitivity analysis (or alternatively included as a covariate) if the accumulated duration of any precipitation lasts for 50% or more of the observation period i.e. 30 minutes or more. Due to a lack of data on how precipitation affects people's behaviour in urban spaces, this threshold seemed a reasonable cut-off to account for the high variability in frequency and intensity of precipitation that can occur during observation periods.
MOHAWk data can be inputted into a statistical software program (e.g. SPSS, Excel) for analysis.

Coding Take Notice and Connect behaviours
Take Notice behaviours occur when individuals stop or slow down, and appear as if they are making a conscious decision to appreciate their surroundings. Examples of this include extended viewing of the scenery, an intentional pause in activity to look at or photograph something in the vicinity, or a pronounced head swivel to look at a specific object, view or person. Connect behaviours occur when individuals are engaging or interacting with a person or the people around them in some way. The activity must involve either conversing with other users (e.g. talking and listening, using sign language), being physically linked with someone (e.g. holding hands, linking arms), smiling and making eye contact when passing, or participation in a group activity. Figure 1. An overview of MOHAWk and procedures for using the tool (Saelens et al. 2016, Broomhall et al. 2004, Millstein et al. 2013 site. Data were collected over a total of 32 hours, including 16 hours using two observers simultaneously. The differences between the two sites (e.g. benches, vegetation) permitted testing of criterion-related validity of observing Take Notice behaviours i.e. whether there are significantly higher observed counts of Take Notice behaviours in Site 1A, where there are more opportunities for Take Notice behaviours, compared to Site 1B.
The unanticipated presence of a temporary statue within Site 1A ( Figure 2) on two days of observations also permitted further criterion-related validity testing: by assessing whether there are more Take Notice behaviours on the two days with the statue, compared to the other six days without the statue in the same site.

Study 2
Study 2 was a feasibility study for a natural experimental study of changes to urban green spaces on older adults' physical activity and wellbeing (Benton et al. 2018b). In terms of the development of MOHAWk, this study had five aims: (1) assess interrater reliability; (2) test several modifications to the MOHAWk tool following Study 1, including refined coding procedures (e.g. recording precipitation), refined age group categories, and other new codes (e.g. mobility assistance); (3) determine how many days of observation per week and hours per day are needed to provide a reliable estimate of activity in a UK urban environment; (4) determine what times of the day observations should be carried out to capture variation in activity across the course of a day; and (5) explore differences in activity patterns on weekdays compared to weekends.
Two observers (JB, SK) used MOHAWk at the same time at two separate residential streets (adjacent to small amenity green spaces) in South Manchester during July 2017 ( Figure 3). One site was a residential street where changes in the aesthetic quality of green space were planned but had not yet been implemented at the time of observations (Site 2A). The other site was a residential street in the same neighbourhood, but no such changes were planned (Site 2B).
Observations were conducted 8am-6pm in 50minute observation periods (e.g. 8-8.50 am, 9-9.50 am etc.). Observation periods lasted 50 minutes, rather than one hour, to provide a 10-minute break for each observer every hour. Data were recorded into three 15minute blocks and one 5-minute block within each observation period to allow investigations of patterns of data within each observation period. On the first two days (Thursday and Friday), both observers independently conducted observations at Site 2A at the same time to assess inter-rater reliability. Then, one observer (JB) conducted observations for seven consecutive days from Saturday to Friday in Site 2A. At the same time, the second observer (SK) conducted observations at Site 2B for five consecutive days from Monday to Friday.
Both sites were similar apart from the following key differences at Site 2B: two benches, a litter bin, and more diverse vegetation. Site 2B was also rated by two observers (JB, SK) as being more aesthetically pleasing and better maintained than Site 2A using a validated tool for measuring the quality of neighbourhood green space (Gidlow et al. 2012). These differences between Site 2A and 2B permitted testing of criterion-related validity: whether there are significantly higher observed counts of Take Notice behaviours in Site 2B, where there are more opportunities for Take Notice behaviours, compared to Site 2A.

Study 3
The aims of Study 3 were to: (1) assess inter-rater reliability; (2) test new coding procedures for recording Take Notice behaviours to improve the reliability of observing these behaviours; (3) test whether MOHAWk can accurately capture Take Notice and Connect behaviours whilst simultaneously collecting other data; and (4) use MOHAWk outside of Manchester i.e. Belfast, Northern Ireland.
Two observers (JB, CC) independently used MOHAWk at the same time for two consecutive days (Monday, which was a national bank holiday, and Tuesday) during August 2018. Observations were carried out at C.S Lewis Civic Square ( Figure 4): a civic square in east Belfast, located at the intersection of the Connswater and Comber Greenways. This site contains public art (seven bronze art sculptures), a coffee bar, several seating areas, and green space. Observations were conducted using hourlong observation periods between 8am-4pm. Data were recorded in 5-minute blocks within each observation period to allow investigations of patterns of data within each observation period. Data were collected over 8 hours, all of which were conducted using two observers simultaneously.
On the Monday, both observers used MOHAWk between 10am and 12pm to assess inter-rater reliability. Take Notice behaviours (e.g. an intentional pause in activity to look at or photograph something in the vicinity) are momentary behaviours and thus more difficult to observe, particularly whilst recording other data. Therefore, in the afternoon (1-3pm), both observers used MOHAWk as normal. However, one observer coded Take Notice behaviours only when people were stationary, whilst the second observer coded all Take Notice behaviours regardless of whether people were stationary or not. This was to test whether coding Take Notice behaviours only when people are stationary improves inter-rater reliability for observing Take Notice behaviours.   On the Tuesday morning (8-9am and 10-11am), one observer used MOHAWk as normal, whilst the second observer only recorded Connect behaviours, gender and age group for each person. Similarly, in the afternoon (12-1pm and 2-3pm), one observer used MOHAWk as normal, whilst the second observer only recorded Take Notice behaviours, gender and age group for each person.

Observer training
There were six unique observers across the three studies (JB, MP, JA, DF, SK, CC). The majority of observers were aged between 18-34 years old (n = 4), and two observers were aged between 35-50 years old. There were four male observers and two females. All observers were physically active according to World Health Organisation criteria (World Health Organization 2010).
Study 1 focused on developing the observation procedures used in a recent study (Anderson et al. 2017), which meant there was limited formal training. An instruction manual was developed as a result of Study 1 -this instruction manual can be found in Supplementary file 1, which provides detailed descriptions of MOHAWk procedures and coding conventions (e.g. how to distinguish between Walking and Vigorous behaviours, how to distinguish between age groups based on gait, clothing, and other physical attributes etc).
Observers in Studies 2 and 3 were formally trained by JB using the MOHAWk instruction manual and by practising observations in the study sites. Training focused on becoming familiar with the operational definitions, key coding conventions, how to use the observation form, and how to code site incivilities. All observers in Study 2 and 3 received at least three hours of training and practising observations with feedback and inter-rater reliability assessments. The aim of training was to achieve inter-rater reliability of at least 0.75 (intraclass correlation coefficient (ICC)) for assessing the total number of people, each behaviour (Sedentary, Walking, Vigorous, Take Notice, Connect) and each participant characteristic (gender, age group, ethnicity). Any discrepancies between observers were resolved by discussion. Before each study, observers agreed on the boundaries of the target area in which all observed individuals were recorded. Table 1 contains a summary of the methods and analyses used to address each aim across the three studies. The unit of analysis for all analyses is at the level of the observation period i.e. counts per observation period.

Reliability of shortened observation schedules
Two-way mixed, single measure, consistency ICCs were used to calculate the average reliability of overall daily counts of observed people at Site 2A and 2B for different abbreviated schedules. Specifically, ICCs were calculated for all possible abbreviated observation schedules across a week: combinations of 1, 2 or 3 days per week compared to the full 5 days per week (weekdays only). In the same way, ICCs were calculated for all possible abbreviated observation schedules across a day: combinations of 2, 3, or 4 hours per day compared to the full 10 hours per day. Two hours a day collection was defined as 1 hour in the first half of the day (between 8 am-1 pm) and 1 hour in the second half of the day (between 1-6 pm); three times a day was defined as morning (8 am-12 pm), early afternoon (12-3 pm) and late afternoon/early evening (3-6 pm); and four times a day was defined as early morning (8-10 am), late morning (10 am-12 pm), early afternoon (12-3 pm) and late afternoon/ early evening (3-6 pm). These analyses were conducted separately for each age group: children, teens, adults and older adults.

Criterion-related validity
A Mann-Whitney U test was used to compare the number of Take Notice behaviours per observation period between: (i) Site 1A and Site 1B; (ii) two days when there was a temporary statue at Site 1A

Patterns of activity
Two-way mixed, single measure, consistency ICCs were used to calculate the consistency in overall counts of people on weekdays compared to the weekend. Table 2 displays descriptive statistics for all three studies. Supplementary file 4 reports on baseline data from three separate natural experimental studies that recently used MOHAWk (Benton et al. 2018a(Benton et al. , 2018b. Supplementary file 5 contains details of several key refinements that were made during each study to improve the reliability, validity, and usability of MOHAWk.

Inter-rater reliability
Across the three studies, inter-rater reliability between pairs of observers was mostly 'good' or 'excellent', with a small In Study 3, inter-rater reliability was 'good' for recording Take Notice behaviours when one observer only recorded Take Notice behaviours, gender and age group for each person, whilst the second observer used MOHAWk as normal (ICC = 0.80). Inter-rater reliability was 'excellent' for recording Connect behaviours when one observer only recorded Connect behaviours, gender and age group for each person, whilst the second observer used MOHAWk as normal (ICC = 0.97). Table 3 displays ICCs for all shortened schedules for each age group. On average, observing on one day a week can produce good consistency approaching that obtained by observing five days a week for adults. For teens and older adults, observing on two days a week can produce good consistency approaching that obtained by observing five days a week. For children, observing on three days a week can produce good consistency approaching that obtained by observing five days a week.

Reliability of shortened observation schedules
On average, observing on two hours a day can produce good consistency approaching that obtained by observing 10 hours a day for adults. For teens and older adults, observing on three hours a day can produce good consistency approaching that obtained by observing 10 hours a day. For children, more than fours a day is required to produce good consistency approaching that obtained by observing 10 hours a day.
In Study 2, there were more Take Notice behaviours observed per 15-minute block in Site 2B (median = 2, IQR = 3) compared to Site 2A (median = 0, IQR = 1) (p < 0.001). Figure 5 displays patterns of the total number of people across each hour of the day for each age group at Site 2A in Study 2, comparing the average weekday, Saturday and Sunday. There was poor consistency between the average weekday and Saturday (children: ICC = 0.28, teens: ICC = 0.46, adults: ICC = 0.46, older adults: ICC = 0.43) or Sunday (children: ICC = 0.21, teens: ICC = 0.33, adults: ICC = 0.31, older adults: ICC = 0.45) for each age group.

Discussion
These three studies indicated that MOHAWk is a reliable and valid observation tool. There was high agreement between pairs of observers for recording people's characteristics and their behaviours when using MOHAWk. There was evidence that shortened observation schedules can provide reliable estimates of people using urban spaces. In addition, there was evidence of criterion-related validity of observing Take Notice behaviours. We have provided extensive normative data (means, standard deviations, total counts) on all wellbeing behaviours in a variety of urban spaces to inform sample size calculations when using MOHAWk in natural experimental studies of urban environments (see Table 2 and Supplementary file 4).

Inter-rater reliability
There was good or excellent agreement between pairs of observers (ICC > 0.75) for 93% of observed behaviours and characteristics using six unique observers across three studies. This high agreement suggests that different observers can use MOHAWk and still produce very similar data, thus allowing reliable evaluation of multiple urban spaces at the same time. Interrater reliability remained high even when observing in busy urban spaces (e.g. there was an average of 118 observed people per hour in Study 2), suggesting that MOHAWk is robust enough to withstand busy urban spaces. We recommend a minimum of one full day (i.e. eight hours) of training and practice. However, the exact amount of training and practice required will depend on numerous factors, such as previous experience of the observer(s) in using MOHAWk and how busy the target areas are likely to be.
Maintaining high inter-rater reliability was an important consideration when developing the tool; for example, we reduced ethnicity codes into two categories to make it easier for observers to accurately record multiple wellbeing behaviours. Study 3 demonstrated that observers can still achieve high inter-rater reliability when recording the additional Connect (ICC = 0.97) and Take Notice (ICC = 0.80) behaviours. However, agreement on recording Take Notice behaviours tended to be lower than agreement on recording Connect and physical activity behaviours (Sedentary, Walking, Vigorous). This is likely because Take Notice behaviours are more momentary than Connect and physical activity behaviours, which makes Take Notice behaviours harder to observe; for example, someone pausing to look at something in the vicinity (Take Notice) is typically more momentary than someone holding hands (Connect) or cycling (Vigorous). Therefore, training should focus more on improving inter-rater reliability for recording Take Notice behaviours.

Reliability of shortened observation schedules
Study 2 showed that shortened observation schedules can provide reliable estimates of people using an urban space across a week and across a day, albeit not for children. These results are in line with a previous study that found shortened observation schedules using SOPARC can provide reliable estimates of park usage in the US (Cohen et al. 2011). This provides increased confidence that shortened observation schedules can provide reliable data, therefore reducing the time and cost required for observations. As a general guide, observing at least four hours a day, two days a week is recommended, although other schedules are also reliable (see Table 3).

Criterion-related validity
An important difference between MOHAWk and existing observation tools is the addition of two other wellbeing behaviours in MOHAWk: socially interacting with others (Connect) and taking notice of the environment (Take Notice). There is evidence that these additional wellbeing behaviours being assessed are valid and meaningful. Studies 1 and 2 demonstrated that there were significantly higher observed counts of Take Notice behaviours in sites that were hypothesised as offering more opportunities for Take Notice behaviours e.g. 'greener' and more aesthetically pleasing sites. This suggests that MOHAWk is accurately capturing Take Notice behaviours, since observed behaviours were in line with what we would expect to observe. Further, the frequencies of Take Notice and Connect behaviours varied between different sites and in different age groups (see Table 2), as well as between weekdays and weekends, which suggests that MOHAWk is sensitive to change and can therefore be used to measure the effect of interventions. These data build on those from an early version of MOHAWk (Anderson et al. 2017), where the researchers observed significant increases in wellbeing behaviours assessed by MOHAWk after one year following an urban pocket park intervention i.e. demonstrating sensitivity to change.
The codes for observing physical activity in MOHAWk (Sedentary, Walking, Vigorous) are based on previous observation tools (Mckenzie et al. 1991, McKenzie et al. 1992, 2000, 2006, which have been validated using heart rate monitors (Mckenzie et al. 1991), pedometers (Rowe et al. 2004) andaccelerometers (McKenzie et al. 1994); this suggests the physical activity codes used in MOHAWk are valid.

How MOHAWk compares to existing observation tools
To the best of our knowledge, there are no existing observation tools validated for use in urban spaces that typically have lower numbers of users (e.g. amenity green spaces) or that people pass through (e.g. residential streets). MOHAWk uses continuous scanning to count all people and their activities during one-hour observation periods, thus capturing activity in urban spaces which have lower levels of use that could not be captured by single observational scans used in existing observation tools. This is important because MOHAWk is more likely to produce larger sample sizes and thus better powered studies that require fewer observations due to increased sensitivity.
We have found evidence that MOHAWk is valid in two UK cities. There are no existing observation tools validated for use in Europe, and the vast majority of studies using existing observation tools have been conducted in the US (Evenson et al. 2016, Joseph andMaddock 2016). This is an issue because the US is different to many European cities in terms of key variables that influence people's use of urban environments; for example, urban sprawl is much more prominent in the US compared to Europe (Patacchini et al. 2009). Therefore, it is unclear whether existing observation tools are valid outside the US.
Existing tools, such as SOPARC, recommend that observations are not carried out during inclement weather. SOPARC was developed in California (US), which has a climate characterised mainly by mild-tohot and dry weather. However, it is impractical and costly to use SOPARC procedures and rearrange observation periods during inclement weather in cities that have much higher levels of rainfall, such as Manchester and Belfast (The World Bank 2019). The many issues of rearranging observation periods due to weather have been discussed elsewhere (e.g. ambiguous weather forecasts) (Veitch et al. 2017). To address this issue in MOHAWk, observations are carried out regardless of weather conditions, but a sensitivity analysis controls for the confounding influence of precipitation (or alternatively including precipitation as a covariate). Whilst precipitation did not affect any of the observation periods in the three studies reported here, a recent natural experimental study that used MOHAWk in the UK had 50 hours of observations (out of 264 hours) that were removed for the sensitivity analysis due to precipitation (Benton et al. 2018b). Rearranging 50 hours of observation would have been costly and would have affected the design of the study, thus potentially introducing bias.

Strengths of MOHAWk
Many reviews have shown there is a scarcity of robust natural experimental studies of the causal effects of the urban environment on physical activity and wellbeing, particularly in Europe (Hunter et al. 2015, Benton et al. 2016b, Roberts et al. 2018, MacMillan et al. 2018, Kärmeniemi et al. 2018, Houlden et al. 2018, Moore et al. 2018. MOHAWk is an unobtrusive measure that will allow more robust natural experimental studies in this area by providing a measure that has evidence of reliability and validity, and is validated for use in Europe. MOHAWk is currently being used in three separate natural experimental studies of interventions in different types of urban spaces in Greater Manchester (UK), including residential streets (Benton et al. 2018b), green corridor (Benton et al. 2018a), and a small park.
We have demonstrated that MOHAWk is a reliable tool, with evidence of validity of observing Take Notice behaviours. This is important given that previous studies in this field have often relied on outcome measures that have not been validated: a recent review on natural experimental studies of changing the built environment on physical activity found that seven out of fifteen outcomes were reported using unvalidated outcome measures (Benton et al. 2016a). Results from the present studies suggest that MOHAWk is a reliable and valid outcome measure that can be used in a range of urban environments, thus promoting comparability between studies.
The three studies reported here have provided 156 hours of normative data, and to date there are also a further 172 hours of baseline data from three other natural experimental studies (Benton et al. 2018a(Benton et al. , 2018b, with data provided in Supplementary file 4. These normative data will help researchers conduct sample size calculations for future natural experimental studies; a lack of sample size calculations is a key weakness of previous natural experimental studies of urban spaces on physical activity (Hunter et al. 2015). These normative data will also help researchers determine the frequency and timing of observations to obtain accurate estimations of activity. More MOHAWk data are now needed in different urban spaces, settings (especially outside the UK) and populations (especially children and teens).
MOHAWk is a tool that can be used by most people. Existing observation tools have previously been used by non-researchers, such as volunteers from the local community (Tully et al. 2013). Therefore, MOHAWk can feasibly be used in policy or practice by stakeholders involved in the planning, design, implementation and maintenance of urban spaces; such as local authorities, public health practitioners or developers. This new tool is an inexpensive and easy-to-use method of generating wellbeing impact evidence in relation to changes in the urban physical or social environment. A manual providing detailed instruction on how to use MOHAWk (Supplementary file 1) and observation forms (Supplementary files 2 and 3) are freely provided to facilitate its widespread use -please contact the corresponding author for further assistance on how to use MOHAWk.

Future research
Further psychometric testing of MOHAWk is required to increase evidence of validity and reliability, particularly for observing Connect behaviours as there was no validity testing for this in the present studies. For example, researchers could evaluate whether there are significantly more Connect behaviours during events where one would expect more social interactions (e.g. a summer fair) compared to days where there is no such event. Sensitivity to change is currently being tested in three separate natural experimental studies in Greater Manchester (Benton et al. 2018a(Benton et al. , 2018b, but more natural experimental studies are needed to assess how responsive MOHAWk is to change. We encourage other researchers to use MOHAWk to evaluate environmental interventionssee methods described elsewhere (Benton et al. 2018a(Benton et al. , 2018b for examples of how to use MOHAWk in natural experimental studies. Future research should also look into using MOHAWk in the evening, although this may only be feasible using image/videocapture devices due to ethical issues associated with deploying observers outside daylight hours.