Seeing through a new lens: exploring the potential of city walking tour videos for urban analytics

ABSTRACT City Walking Tour Videos (CWTVs) are a novel source of Volunteered Geographic Information providing street-level imagery through video sharing platforms such as YouTube. We demonstrate that these videos contain rich information for urban analytical applications, by conducting a mobility study. We detect transport modes with a focus on active (pedestrians and cyclists) and motorised mobility (cars, motorcyclists and trucks). We chose the City of Paris as our area of interest given the rapid expansion of the bicycle network as a response to the Covid-19 pandemic and compiled a video corpus encompassing more than 66 hours of video footage. Through the detection of street names in the video and placename containing timestamps we extracted and georeferenced 1169 locations at which we summarise the detected transport modes. Our results show high potential of CWTVs for studying urban mobility applications. We detected significant shifts in the mobility mix before and during the pandemic as well as weather effects on the volumes of pedestrians and cyclists. Combined with the observed increase in data availability over the years we suggest that CWTVs have considerable potential for other applications in the field of urban analytics.


Introduction
55% of the world's population currently live in urban environments, and it is estimated that this proportion will grow to 68% by 2050 (United Nations, Department of Economic, and Population Division Social Affairs 2019).Understanding how citizens experience, interact with and are influenced by the cities in which they live is at the core of the emerging field of urban analytics or city science (Batty 2019).Central to the emergence of urban analytics are new data sources recording the ways in which individuals and groups behave at increasingly high spatial and temporal resolutions.This big data revolution, perhaps most effectively summarised by Michael Goodchild, in his seminal paper discussing User-Generated Content (UGC) and introducing the spatially more explicit Volunteered Geographic Information (VGI) (Goodchild 2007) has given rise to numerous studies.These include many exploring human aspects of cites, for example the ways in which we partition them and name their parts (Hollenstein and Purves 2010;Hu et al. 2015), the sounds we perceive as we move around cities (Aiello et al. 2016;Zhao et al. 2023) and feelings about urban green spaces (Gugulica and Burghardt 2023;Roberts, Sadler, and Chapman 2019).All of these examples have potential in describing and monitoring cities with respect to the ways in which we use these spaces sustainably.
One particularly important aspect of sustainable development in cities is mobility (Banister 2008).The way in which we move around cities, travelling to and from work, school or leisure activities, to pursue our everyday lives is central to the quality of our lives in these spaces.Internationally, cities aim to reduce reliance on private modes of transport, shifting journeys to public and active transport modes (Winters et al. 2013).Reducing the distances that we travel in cities, through policy interventions to provide more services locally and make travel by foot and bicycle more attractive is also seen as being a possible route to increasing the quality of life of those living in cities (Mulley et al. 2013).Measuring the mix of travel modes along the arteries of a city, and their variation in time -the city's pulse -is the subject of this paper.Measurements of this behaviour often rely on surveys carried out occasionally, traffic counters, travel diaries and, more recently information extracted from GPS tracking, traffic and surveillance cameras and travel data collected by public transport authorities (Houston, Luong, and Boarnet 2014).However, the emergence of VGI, and the notion of citizens as sensors, brings with it the idea that we might also explore this behaviour through data traces collected by individuals in the form of pictures and videos.
Continuous improvements in computer vision algorithms coupled with a move towards Open Source implementations have led to rapid growth in the development of approaches to urban analytics which extract some form of semantics from raw images to classify and characterise cities.Three data sources are particularly prominent in such endeavours.
First, Google's Street View data provide dense, georeferenced coverage, especially of road networks, in many cities.These data have been used among others to automatically estimate transport modes such as pedestrian or cyclist volumes (Hankey et al. 2021;Yin et al. 2015), with Goel et al. (2018) finding a high correlation between Google Street View derived cyclist counts and traditional census data across 34 cities in Great Britain.Counting objects effectively in space requires the use of not only object detection methods but also object tracking (Ewing et al. 2013;Purciel et al. 2009).Despite the popularity of Google Street View, largely driven by its spatial coverage and relatively easy access, these data also have a number of limitations.The temporal frequency of updates is highly variable, favouring urban over rural areas, with maximum visitation frequency of 2-3 years. 1 For any given location, only one version of Street View exists, and thus change cannot be captured through this dataset.Second, traffic video data, often collected by local governments to monitor traffic conditions and enforce tolls have proven an effective data source from which traffic counts and modes can be automatically extracted.Traffic videos are often recorded through stationary CCTV cameras at critical locations such as dangerous, busy or complex intersections.These generally high-resolution videos are of interest particularly for urban mobility studies about road safety (Saunier et al. 2014) such as the detection of road collisions (Saunier and Sayed 2008) and pedestrian-vehicle conflicts (Ismail et al. 2009).The review by Espinosa, Velastín, and Branch (2021) summarises detection approaches based on the example of motorcycles as vulnerable road users in urban traffic videos which generally provide detailed information about the traffic mix at a specific location (Jodoin, Bilodeau, and Saunier 2014).However, despite their high image and temporal resolution, these data typically lack spatial coverage, being available only at a select few locations.Thirdly, VGI data in the form of images, for example as uploaded by users to the image sharing platform Flickr, often provide rich data with respect to more popular locations within cities.In a recent publication by Knura et al. (2021) image object recognition was applied to geolocated Flickr images to detect bicycles and to analyse their spatial distribution across the City of Dresden.
New data forms capturing different aspects of city life are constantly emerging, and using these data to complement more traditional sources has value for urban planners and others interested in studying cities. Combining these data sources with advances in computational methods creates opportunities to study both emerging phenomena and to revisit old questions with novel forms of data (Arribas-Bel 2014;Ruppert 2013).During the Covid-19 pandemic and the associated travel restrictions, videos showcasing cities made accessible on platforms such as YouTube, increased rapidly in popularity.These user-generated videos are characterised by first person, street-level footage of a person walking in a continuous manner through typically popular urban areas, containing points of interest with touristic value, recording local sights and sounds.In this paper we label such videos City Walking Tour Videos (CWTV).CWTVs capture very rich information about the areas traversed, since the video data are of high resolution and have typical durations of around 1 hour.
The work by Lewis and Park (2018) is to our knowledge the first and only consideration of usergenerated YouTube videos in a geographic context.They termed this new data form as Volunteered Geographic Videos (VGV) and investigated its value through two case studies one of which was a study of topographic data extraction after a landslide in the field of physical geography.Their manual inspection of video sections of known locations confirmed the value contained within YouTube videos and the real-world applications thereof.Lewis and Park (2018) recommend further VGV studies to focus on the relations between humans and the built environment in urban areas in an effort to better understand how available imagery repositories can help to complement existing datasets.We consider CWTVs a specific subcategory of VGVs given their above listed, unique characteristics.To our knowledge in this paper, we provide the first instance of using CWTVs to demonstrate their potential as a data source about human urban mobility through detection of individual transport modes (i.e.pedestrians, (motor-)cyclists and more).
To explore the possibilities of CWTVs as a data source for human urban mobility applications, we built an automated workflow to extract and count transport modes from CWTVs, using metadata and video content to georeference locations at the street segment level (RQ1).We chose the city of Paris as a region of interest to assess the data quality and characteristics of CWTVs (e.g.data volume, coverage) to quantify the transport mix in urban areas.Paris was chosen because as a popular destination, we expected to find good spatial and temporal coverage of CWTVs (RQ2).Furthermore, the city has a number of policies to encourage active travel, and this trend was accelerated during the Covid-19 pandemic, which fuelled a need for a further expanded and safe bicycle network.For example, in an effort to accommodate the rapid increase in cyclists and to keep them safe, the city quickly deployed 'pop-up' bike lanes which were soon referred to as 'coronapistes '. 2 This rapid deployment of built infrastructure leads us to hypothesise that these coronapistes are visible as changes in the transport mix at specific locations and allow us to explore the potential efficacy of CWTV data as a mobility indicator (RQ3).Summarising, we address three research questions in this paper: (1) How can an automated workflow be developed to extract and georeference transport mode counts from City Walking Tour Videos?(2) What are data characteristics of City Walking Tour Videos of the city of Paris in terms of data volume and spatio-temporal coverage?(3) Do City Walking Tour Videos of the City of Paris between 2015 and 2021 reflect changes in the mix of transport modes, and do we find evidence of increased active travel?

Materials and methods
Our complete workflow is visualised in Figure 1.Our starting point is CWTVs recorded in Paris, and the outputs of the complete model are geolocated counts of active travel (pedestrians and cyclists) and motorised transport (cars, trucks and motorcycles).Geolocations are provided at the granularity of street names which were extracted from OpenStreetMap (OSM) data.Associated with each geolocated measurement is a timestamp and, where available, weather information.
Our workflow uses the YouTube API to search for videos and their metadata in Paris.From the metadata, we extract timestamps, weather information and locations.Video analysis is centred around two main tasks.First, we identify locations using Optical Character Recognition (OCR) to find location information (e.g.street names) within the video content.These locations, and any others explicitly described within the metadata are used as anchors around video segments on which we perform the transport mode detection.Second, we carry out object detection -to identify different candidate transport modes -and tracking -to follow individual objects across frames in order to prevent multiple counts of the same object.Finally, we apply a set of simple rules to combine objects leading to our final transport modes.In the following, we describe each of the carried out steps in our workflow in more detail.

Location gazetteeer
To link videos to locations, we relied on placenames found within our metadata, and those which we identified through OCR in video segments (step 2.2).We decided to use OSM data for this task since it is arguably the most successful VGI project on the internet and it plays an important role in various Digital Earth applications, e.g.disaster mapping (Mooney and Corcoran 2014).OSM allowed us to easily link names to geometries while making our approach transferable to other cities with good OSM mapping coverage.We created our location gazetteer consisting of a range of feature types, including street names, tourist attractions and buildings (e.g.museums, operas and boulevards), which link to a specific location and which are likely to be featured in CWTVs.We extracted the data using an OverPass 3 query displayed in Listing 1.
Listing 1. Overpass OSM query to obtain the street for the City of Paris.Double forward slashes denote comments.

Data acquisition
For the workflow to function, input data of the City of Paris was required (step 1).We were interested in videos depicting a continuous walk around the City of Paris from the view of a pedestrian (see Figure 2).Two exemplary reference videos that we considered in our study can be viewed here. 4We used a custom created wrapper around the Python library 'pytube' to download video MP4 files based on the query 'Paris walk' in 720p resolution (step 1.1).This query initially returned 1645 potential videos.We further condensed the pool to 142 videos by filtering for videos that contain location-based timestamps in their video description (as seen in Figure 3).
Finally, we performed a manual quality check on the videos and excluded examples that showed heavy video editing, transitions and cuts, bird's eye views (e.g.drone footage) or those with long static sections without movement.Thereby, 62 or 44% of the remaining videos were filtered out.After all of these steps we were left with 77 videos and more than 66 hours of video footage which are accessible through links provided in Appendix 1.
As well as the visual information encoded in the videos we were also interested in the video metadata which contained user-generated textual information including video titles, descriptions and upload dates written by the video creator.We acquired this information for each video through the YouTube API search endpoint. 5This metadata was used to extract locationrelevant timestamps for the purpose of geolocation and video filtering (as described above), weather relevant information and specific times of recording as an alternative to the upload date.

Data processing
Raw MP4 video files were treated as sequences of images which can be analysed frame by frame.Each frame was run through the object detection model (step 2.1) Yolov5 (Jocher et al. 2021) to detect generic objects within the COCO image dataset (Lin et al. 2015) which contains large  amounts of visual training data for 80 generic objects and has been used to benchmark different model architectures.We were specifically interested in detecting transport or mobility relevant objects including 'person', 'bicycle', 'motorcycle', 'car' and 'truck'.In contrast to conventional image object detection, when working with videos it is necessary to track objects across frames to not count them multiple times.For this we used the DeepSort algorithm, which assigns a unique object identifier to individual objects persisting across frames to perform object tracking (Wojke, Bewley, and Paulus 2017).The output of step 2.1 is thus a track-log for each video.Each row in a track-log represents a detected object together with a frame number, unique object id, the corresponding COCO class index and bounding box.The Convolutional Neural Network (CNN) YOLOv5 has outstanding inference speed due to its single-forward pass implementation (Jocher et al. 2021) which makes it particularly suitable to working with large data volumes in real-time with modest hardware requirements.The combination of the tools YOLOv5 and Deep-Sort was chosen since both implementations are open and aligned with each other, they were considered state-of-the-art tools at the time of implementation, and given their popularity were well documented.On top of that they have been used in various existing mobility studies, among others to detect and count pedestrians (Qiu et al. 2021;Razzok et al. 2023) and vehicles (Gao 2022).
To identify candidate location names in video frames, we used the Optical Character Recognition (OCR) algorithm easyocr (JaidedAI n.d.) (step 2.2) to extract text from images.Since OCR is a computationally intensive operation, we only performed it on every third frame.The OCR output log of step 2.2 was composed similarly to the previous mentioned track-log.Each row in the OCR-log included the detected text with a corresponding frame number, a confidence score and bounding box.The detected text rarely contains clean street names, but rather noisy text artefacts from billboards, shop names and so on.Before attempting to match these artefacts to our OSM gazetteer we first used rule based filters to eliminate noise.We excluded text artefacts with OCR confidence scores of less than 0.5 and those containing multiple special characters not expected to occur in location names.
The metadata which we obtained from the YouTube API for each video was checked for two key components which we extracted if present (step 2.3).First, the video description was scanned for user-added timestamps as seen in Figure 3.They often contained valuable location information for later geolocation such as toponyms in the form of street names.Second, we were interested in categorising videos into good and bad weather conditions based on keyword searches.For the former, we searched among others for the words 'nice or good weather', 'clear or blue sky', 'sunny' and 'cloudless'.For the latter, we considered descriptions such as 'bad weather', 'dark sky', 'rain(y)', 'wind(y)', 'gust(y)', 'snow(y)' and 'hail'.

Geolocation
YouTube videos are not geolocated at fine granularities.Rather, YouTube allows coarse location tagging of one tag per video at roughly city-level.To extract more detailed spatial information for our purpose we used two complementary approaches in parallel.The first approach used the text artefacts and their bounding boxes recorded in the OCR output log.We first clustered artefacts based on their bounding boxes using HDBSCAN (McInnes, Healy, and Astels 2017) to combine text spanning multiple lines, as is often the case on street signs.To reduce the number of false positives, we then only considered artefacts with ten or more characters.These were then matched to our OSM gazetteer using the Levenshtein distance metric (step 3.1).Levenshtein distance (Miller, Vandome, and McBrewster 2009) is a fuzzy string matching metric which allows the calculation of string similarity (or difference).We retained all strings with a Levenshtein distance of two or less.Our second approach to geolocation leveraged the video timestamps and associated placenames found in the video metadata (step 2.3).Similarly to the first approach we matched them to our gazetteer using Levenshtein distances, but this time with a lower threshold of one or less which mostly accounts for the lack of use of, for example, accents (step 3.2).As a result, these two approaches generated for each video a list of locations with linkage to our OSM gazetteer.

Aggregation of transport modes
Transport modes describe various forms of person-centric movement.To infer transport modes we linked detected people to mobility relevant objects such as 'bicycles' and 'motorcycles' detected within the same frame (step 4.1).Taking the transport mode 'cyclist' as an example, we look for a person whose bounding box is overlapping and vertically stacked on top of a 'bicycle' object's bounding box.We considered all people where no connection to a transport object was found to be 'pedestrians'.Ultimately, we counted and aggregated transport modes in a temporal buffer around all locations found throughout step 3 of geolocation.This buffer was implemented as a variable within our workflow and its value differed according to whether a location was detected via video metadata or OCR (step 4.2).For the former, we aggregated transport modes from the video timestamp for a 90 second time window.For the latter, we analysed frames before and after the identified location (+/-45 seconds), since we assumed the OCR location to be valid for the preceding video sections, whereas the user-generated timestamps in metadata suggest a transition to a new location.
The output of the workflow contained all detected locations enriched with their respective metadata of weather information and daytime as well as their aggregated transport mode counts.The final dataset contained 1169 entries.

Results
In the following, we explore the results of our workflow.We divide this material into two parts.We first describe the properties of the data we extracted in terms of their volume and spatio-temporal coverage (RQ2).Our starting point was our collection of 77 CWTVs, with more than 66 hours of footage.These videos cover a timespan ranging from 2015 to 2021.Contained within this dataset are 1169 locations of which 580 are unique.If we assume that all of these locations were identified with no temporal overlaps, we can estimate the upper bound of (geolocated) video material from which we could estimate transport mode counts as around 29 hours (given our defined buffer of 45 seconds in Section 2.5), or nearly 50% of the total video material.In reality, there are likely to be temporal overlaps between detected locations, but this result suggests that a large proportion of the video footage included material which could be geolocated, and thus was suitable for analysis.
We extracted locations using two methods.First, gazetteer lookup on video timestamps, where we assume a very high precision, since we only allow Levenshtein distances of one between gazetteer entries and video metadata.Of the 580 unique locations, 271 (47%) were identified exclusively using OCR.Second, using video timestamps we found an additional 229 exclusive locations (39%).Finally, 80 (14%) were identified using both methods.To evaluate our OCR method, we selected 50 random locations together with their detection time and compared the video frame manually with the location identified by our workflow.For 31 of these locations, we could unambiguously associate the video frame with the detected location (e.g. based on a street name visible within the video).For the remaining 19, it was not possible to verify the extracted location giving us an estimated OCR precision of 62%.These false positives were often caused by successful gazetteer matches of text that did not originate from street signs, but rather i.e. from shop or restaurant names that could not be unambiguously associated with a single location.Combining the two methods and assuming near perfect precision for metadata look-up, we estimate the overall precision for the locations extracted to be of the order of 82%. Figure 4 shows the spatial distribution of the detected locations across Paris.
As is typical for UGC, our videos as well as the detected locations are not evenly distributed in space.As we might expect, central, typically more touristic areas of Paris show a more complete coverage than those on the periphery of the city.As a consequence, the frequency with which streets are captured varies.Figure 5 gives an overview of the number of times that locations were identified in our data.339 locations were visited only once, 15 locations were visited five times and eight locations ten or more times.The unique locations with the most data were the famous tourist district of Montmartre and the Rue de Rivoli, running through the heart of Paris, with 27 and 22 occurrences respectively.This heterogeneous characteristic of UGC is not only apparent in space but also across time.
Figure 6 shows the temporal change in aggregated data availability for all 80 quartiers (districts) of Paris.We observe a steady increase in data coverage from 2015 to 2019, with a distinct gap in 2016.In 2020, which marks the start of the Covid-19 pandemic in Europe, we record a clear drop in data coverage followed by a rapid recovery in 2021 to a slightly higher level than in 2019.The amount of available CWTV videos contributing to these observations changed drastically over the years.Starting at one video in 2015 and 2016, 19 videos in 2019 and reaching 36 videos in 2021 shows an increase of exponential nature.Year 2020 marks the only year with a distinct lower than anticipated number of videos of a total of seven.
To validate transport mode detection in our workflow, we took a similar approach to OCR locations, assessing videos by hand.To do so we selected two independent video sequences with different characteristics in which all of our important transport modes are represented to validate the workflow output under varying conditions.The first video excerpt 6 was 3.5 minutes in length (V1) and was taken from the Rue de Rivoli, a famous, long street in Paris with an available sidewalk, bus and bicycle lane and two-way road.The second video excerpt 7 of 5 minutes (V2) was taken from the Rue du Faubourg Saint-Honoré during a special police presence where the traffic was regulated and with large numbers of pedestrians.We manually counted the transport modes 'pedestrian', 'cyclist', 'motorcyclist', 'car' and 'truck' for both video sections and compared them with the counts returned by our workflow (see Table 1).Our results across both videos show good agreement between our manual counting and the workflow returned counts for the first four mentioned classes.However, the class 'car' stands out with a strong overestimation by our workflow in video excerpt two.Based on the Yolov5 provided class-specific performance the class 'car' shows the second lowest mean average precision (mAP) of all considered classes (Glenn 2023) -meaning  the model is not as confident in detecting cars.We discuss the implications of these results within the limitations in the discussion.Leveraging our geolocated transport mode counts across all videos we investigate observed spatial patterns aggregated on quartier level (shapefile obtained from French Open Data portal 8 ).The small multiples in Figure 7 visualise the acquired spatio-temporal information by transport mode, covering the timespan between 2017 and 2021.The years 2015-16 are not included given the low number of CWTVs as seen in Figure 6.The absolute counts of all three transport modes per quartier were normalised as the sum of the percentages to allow comparison between years whereby locations that constitute to multiple quartiers (e.g.Rue de Rivoli) were counted towards all bordering aggregation units.Given our special attention towards active transport and its dominant share within the data we considered pedestrians and cyclists separately and grouped all motorised transport including cars, trucks and motorcyclists.The observed data coverage and the change in transport mode volumes shows a pulsing, dynamic behaviour across years in the form of visual changes in measurement per quartier.The years leading up to 2020 show increased spatial coverage (more quartiers with data as already shown in Figure 6) and increasing measurements of transport modes.2020 -where restrictions on movement due to the Covid-19 pandemic were strongest -has more limited spatial coverage with a distinct city-centred focus compared to the other years.
To further investigate the change in transport modes over the years, we aggregated the timespans before (2015-2019) and during the Covid-19 pandemic (2020Covid-19 pandemic ( -2021) ) to compare the five most data rich quartiers, namely Clignancourt, Halles, Saint-Germain-l'Auxerrois (SGlA), Saint-Germaindes-Prés (SGdP) and Sorbonne.We performed a Mann-Whitney U test under the assumption that observations from different quartiers are independent of each other.As seen in Figure 8, we found significant increases in the relative share of pedestrians for the quartiers Sorbonne, Clignancourt and Halles before and during the pandemic for a confidence level of 0.05.We also found significant decreases in motorised transport for the quartiers Sorbonne, Halles and SGlA.We did not detect any significant change for cyclists, whose absolute counts were generally low, although mean values for all quartiers except SGlA were higher during the pandemic compared to before.
Lastly, we checked for weather driven effects on the share of observed transport modes within our dataset.Figure 9 shows change in behaviour indicating significant weather effects on pedestrians and cyclists, with a decline in activity during bad weather.Motorised transport shows no significant relationship to weather at these locations.

Discussion
In the following discussion we first return to our research questions, before exploring limitations of our approach and, finally, the broader implications of our work.
We were able to implement an automated workflow (RQ1) that can geolocate and detect transport mode counts from City Walking Tour Videos (CWTVs).We investigated workflow performance of the two main components: namely geolocation of video segments and transport mode detection.
Geolocation of video sections was achieved by extracting location information either from the user-generated timestamps or the video frames through OCR.Our approach showed that by combining visual and textual information we were able to increase the overall count of geolocated sections and thus the number of detected locations.The use of OCR led to roughly 10% more unique locations than using timestamps alone.OCR detected locations had lower precision (0.62) than the overall workflow (0.8), assuming manually assigned metadata to be perfect.Most false positives resulted from OCR matched text which did not originate from street signs.Future possible improvements to our algorithm could include an additional step of street sign recognition similar to Wojna et al. (2017).The overall quality of the geolocation step is strongly dependent on the quality of the used gazetteer as well as the video resolution.In our case, we compiled a gazetteer based on features extracted from OpenStreetMap, which has been found to provide timely and rich data for the City of Paris (Girres and Touya 2010).Removing points of interest from our gazetteer and using only street names would likely increase precision at some cost to recall during the OCR geolocation step (step 3.2 in Figure 1).
We validated transport mode volumes by manually annotating two video sections of 3.5 minutes (V1) and 5 minutes (V2) duration.V1 was taken from the Rue de Rivoli -a prominent and heavily frequented road in the heart of Paris which offers separate bus and bicycle lanes, pedestrian sidewalks and a two-way road.The walk trajectory in the video was straight, following the road, with the camera directed straight ahead.V2 captured a very different situation, impacted crowd control measures which redirected large numbers of pedestrians.Generally speaking, the counts in both videos are in broad agreement with those derived from our workflow, except for the class 'car' (see Table 1), especially in V2.Importantly, people-centric transport modes including pedestrians and cyclists, important for active travel, show good performance in both test videos.
Besides workflow performance our results are also influenced by subjective decisions we took while conceptualising our pipeline.One important parameter is the temporal buffer around detected locations within which we count transport modes.The results reported used a temporal window of 90 seconds, and it is clear that altering this parameter impacts overall counts.The value we chose was based on some empirical sensitivity tests, and generating dynamic buffer values based on the location type could be investigated to improve attribution of transport modes to locations.
We were able to create spatio-temporal small multiples summarising travel modes across Paris, which demonstrate the more general potential of CWTVs as a way of exploring sustainable cities.Data volume has steadily increased, and although coverage is biased towards more central areas, we built a very rich dataset with sufficient coverage to explore temporal variation across Paris quartiers.Furthermore, by grouping data we could also explore the effects of weather conditions and the Covid-19 pandemic.Information on the former was extracted from the user-generated video metadata, but could also be approximated through meteorological data for the given place and time.Additionally, the visual information within the videos could be explored in future work to extract physical conditions e.g.wet roads or snow.Our data contrasts with previous work using Google Street View, which has very high spatial resolution, but is typically limited to a single temporal snapshot and with only limited user-generated contextual information.
Our workflow successfully extracted spatial and temporal information capturing different travel modes across Paris (RQ3).We found significant changes in the numbers of pedestrians across Paris related to the Covid-19 pandemic, similar to behaviour observed in other cities (Doubleday et al. 2021) and identified weather related effects on active travel (Miranda-Moreno and Nosal 2011).However, the proportion of cyclists in our data was much lower than that of both pedestrians and motorised transport.This is likely due to the way in which CWTVs are captured, as pedestrians move through the city, often favouring pedestrianised areas over, for example cycle paths and in particular the coronapistes whose development was accelerated by the City of Paris during the pandemic.

Limitations
Our work and the analysis of the results have revealed important limitations of CWTVs as a data source which also impact our conceptualised workflow.
There are a number of limitations to our work, some of which are general to the use of VGI and CWTVs, and others relating to conceptual and analytical decisions we took.Generally speaking, social-media based VGI datasets are available through commercial platforms, in our case YouTube, and licensing and access to the videos could change quickly and without notice (MacFeely 2019).Furthermore, the available data volumes are dependent on the platform popularity and the underlying user-base dynamics (Wu et al. 2016).We observed a typical location bias which suggests that on top of the general urban-rural divide (van Zanten et al. 2016) CWTV data over-represents commercial and touristic places.This observation is inline with the assumed aim of the creators of CWTVs which try to offer the opportunity to virtually experience a city through its scenic places that people would visit if they were there themselves.This bias in turn means that open public spaces with ongoing events and pedestrianised streets are likely favoured, and is an important limitation in the interpretation of our results.Furthermore, it is important to consider the implications of gaps in coverage -which may reinforce other forms of inequality, as has been shown with respect to sensors in so-called smart cities (Robinson and Franklin 2021).
More specific limitations with respect to the transport mode detection were mostly linked to poor performance detecting the class 'car' in one setting.Closer inspection of the annotated video containing object ids and corresponding bounding boxes revealed that object tracking rather than object detection was responsible for most of the mismatch in observed cars.Heavy occlusion of cars by groups of people and other objects led to newly assigned object ids, and the same car being counted multiple times.Since the majority of cars were stationary the occlusion effect persisted longer compared to moving objects which pass the observer more quickly.This shows the intrinsic characteristics of a video recording can strongly impact workflow performance.These include that CWTVs are recorded from the view (and height) of a pedestrian moving along intended walkways behind and past other pedestrians that obstruct the view of other transport modes that appear, i.e. on the street or on bicycle lanes.We therefore recommend to adapt the analysed videos accordingly, meaning to use videos recorded from the view of cyclists if they or their used infrastructure are the target of interest.Such targeted data collection could also reduce bias introduced through the themes and interests of the individuals recording CWTVs.
Since the data we used are easily accessible on YouTube they are also susceptible to unethical use or misappropriation.These videos of high-resolution imagery capture identifiable faces of individuals who did not actively consent to being analysed.We want to highlight the ethical implications of using these data both in future work and with respect to our own use case.Boyd and Crawford (2012) argued 'just because it is accessible does not make it ethical'.In our case, we have chosen to firstly, publish our entire workflow, making it possible to repeat our work.Since our workflow uses videos found on YouTube, if these videos are deleted, then they cannot be reanalysed.Furthermore, our algorithmic approach only creates and reports aggregated values at locations as counts of different travel modes, and explicitly does not track individuals.
In closing, our work has broader implications for work using VGI to study cities.A recent review by Hu et al. (2021) investigated the value of various human mobility data to measure the impacts of the Covid-19 pandemic and proposed the use of social media derived mobility data.However, video data was not part of this study, and given our promising results we suggest considering CWTVs as a valuable VGI data source for future work on sustainable human mobility.More broadly, videos are composed of a range of modalities -we selected images and text for our analysis.We see untapped potential for further research by investigating a video's audio track as an additional sound modality.In an urban context noise levels (dB) derived from video footage could function as indicator for traffic congestion or even air pollution concentrations.The absence or reduced levels of noise could be valuable for identifying areas of tranquillity which are important for human recreation (Aiello et al. 2016).Lastly, detecting and linking specific sound wave patterns (fingerprints) to sound emitters similarly to object detection models on visual data could function as indicator for the presence of certain events or objects.Studies have shown the benefit in data volume and data quality by using multiple modalities in conjunction based on social media posts (Hartmann et al. 2022) -we expect the same to be true for video data such as CWTVs.

Conclusion
In our study, we explored the potential of a novel VGI data source that we term City Walking Tour Videos short CWTVs in an urban analytical context which contains street-level imagery similar to Google Street View.We successfully implemented an automated workflow which allowed 1169 transport mode measurements at 580 unique locations within the City of Paris based on over 66 hours of video footage.The heterogeneous content of CWTVs required an initial video quality assessment through human evaluation to filter out unsuitable recordings, a process which could potentially be automated in the future.Our results confirm the usefulness of CWTVs through an increasing volume of available data.Additionally, the data shows promising characteristics such as extensive spatial and temporal coverage as well as effects of real world events within the research area.We observed significant differences between transport mode volumes before and during the Covid-19 pandemic as evidence of an increase in active travel.Also, inline with existing literature the impact of weather on active forms of mobility is significant within our dataset.Overall we conclude that CWTVs are able to capture mobility within a city as potential complementary data source to existing ones.Especially in the field of street-level imagery the enormous data increase of videos compared to approaches based on individual photographs (Knura et al. 2021) could be significant.A variety of other applications for CWTVs are imaginable, i.e. mapping built infrastructure (e.g.benches, trash cans, fire hydrants) or capturing changes in green space which should encourage other researchers to test plausible applications for this data within their own field.

Figure 1 .
Figure 1.Schematic visualisation of the transport mode detection workflow.

Figure 3 .
Figure 3. Example of timestamps found within a video description that link a video segment to a specific location.. Source: https://www.youtube.com/watch?v=_dRjY9gMcxE

Figure 2 .
Figure 2. Example City Walking Tour Video shot in Zurich showing object detection rectangles, object tracking ids and visible street name sign in the top left.

Figure 4 .
Figure 4. Contained within our dataset are 580 unique locations (of a total of 1169 records) within the City of Paris.316 of which were found through OCR, 264 in video timestamps and 80 overlap between the two methods.

Figure 5 .
Figure 5. Histogram showing the unique locations by frequency.The inset shows the eight locations with frequencies of ten or greater.

Figure 6 .
Figure 6.Temporal change in the amount of CWTVs and their aggregated spatial coverage across all 80 quartiers of Paris.

Figure 7 .
Figure 7. Visualisation of the change in relative levels of active (pedestrians and cyclists) and motorised transport (cars, trucks and motorcyclists) between 2017 and 2021 based on the spatially aggregated quartiers in the City of Paris.© Mapbox.

Figure 9 .
Figure9.Mann-Whitney U test with a 0.05 confidence interval shows a significantly lower number of pedestrians and cyclists during bad compared to good weather conditions with no significant effect for motorised transport.

Table 1 .
Validation of the transport mode counts of the workflow compared to the manual evaluation of two video sections V1 and V2.The mean average precision (mAP) values reflects approximate class-specific Yolov5 performance using pretrained COCO weights.