Enhancing spatial accuracy of mobile phone data using multi-temporal dasymetric interpolation

ABSTRACT Novel digital data sources allow us to attain enhanced knowledge about locations and mobilities of people in space and time. Already a fast-growing body of literature demonstrates the applicability and feasibility of mobile phone-based data in social sciences for considering mobile devices as proxies for people. However, the implementation of such data imposes many theoretical and methodological challenges. One major issue is the uneven spatial resolution of mobile phone data due to the spatial configuration of mobile network base stations and its spatial interpolation. To date, different interpolation techniques are applied to transform mobile phone data into other spatial divisions. However, these do not consider the temporality and societal context that shapes the human presence and mobility in space and time. The paper aims, first, to contribute to mobile phone-based research by addressing the need to give more attention to the spatial interpolation of given data, and further by proposing a dasymetric interpolation approach to enhance the spatial accuracy of mobile phone data. Second, it contributes to population modelling research by combining spatial, temporal and volumetric dasymetric mapping and integrating it with mobile phone data. In doing so, the paper presents a generic conceptual framework of a multi-temporal function-based dasymetric (MFD) interpolation method for mobile phone data. Empirical results demonstrate how the proposed interpolation method can improve the spatial accuracy of both night-time and daytime population distributions derived from different mobile phone data sets by taking advantage of ancillary data sources. The proposed interpolation method can be applied for both location- and person-based research, and is a fruitful starting point for improving the spatial interpolation methods for mobile phone data. We share the implementation of our method in GitHub as open access Python code.


Introduction
To better understand how societies function and to improve social justice and environmental sustainability, we need to attain enhanced spatio-temporal knowledge about the locations and mobilities of people (Hägerstrand 1970, Sheller andUrry 2006, Cresswell CONTACT Olle Järv olle.jarv@helsinki.fi The supplemental material for this article can be accessed here. and Merriman 2011, Kwan 2013). This knowledge has become possible due to the global adoption of mobile information and communication devices in the age of the Big Data, while human interactions through mobile devices create digital footprints of people in space and time (Kitchin 2014). Thus, ubiquitous mobile phones are considered as proxies for people (Asakura and Hato 2004, González et al. 2008.
For example, passively collected time-stamped and location-aware information on mobile phone use, such as call detail records (CDRs), is rapidly enhancing our understanding about individual human mobility (Yuan et al. 2012, Järv et al. 2014. Such data is applied to understand the relation between human mobility and social networks (Phithakkitnukoon et al. 2012), ethnic differences in human activity spaces (Järv et al. 2015) and for estimating travel times to public facilities (Wesolowski et al. 2015). At the population level, flourishing literature demonstrates how CDR data is used for investigating spatial distributions and temporal dynamics of local and national population (Reades et al. 2009, Kang et al. 2012, Deville et al. 2014, seasonal migration patterns (Silm and Ahas 2010), spatial structures of cities (Louail et al. 2014), traffic flow (Wang et al. 2012), disease risks ) and socio-economic levels of the population (Soto et al. 2011, Blumenstock et al. 2015.
However, investigation of human presence and mobility based on CDR data imposes theoretical and methodological challenges. For example, one challenging issue is representing the whole population while not everyone is using mobile phones for different reasons (e.g. young children, elderly). Socio-economic biases may exist in a sample while network operators have different segmentation strategies and service costs. Phone users' call activity habits and patterns vary in space and time according to individuals' socio-economic characteristics, preferences, lifestyle, habits and work attributes (Castells et al. 2007). Also, authority (legislation) and environmental (physical extent of mobile network) domains may limit where and when mobile phones can be used. These aspects may influence research findings derived from CDR data while creating spatial, temporal and socio-economic biases at both the individual and aggregated levels (Yuan et al. 2012, Wesolowski et al. 2013, Järv et al. 2014, Zhao et al. 2016. One important methodological challenge is the spatial perspective of CDR data since the spatial accuracy of given data depends on the spatial distribution of mobile network base stations, which is not equally distributed in space and changes over time. Furthermore, while more advanced interpolation techniques are applied (e.g. Louail et al. 2014, Pei et al. 2014), they do not consider the environmental and societal contexts that shape the human presence and mobility in space and time. To obtain more precise spatio-temporal knowledge about human presence and mobility, we need to develop a spatial interpolation technique for CDR data that considers the social structures and human practices.
This paper seeks to contribute to mobile phone-based research in the social sciences by addressing the need to give more attention to the spatial perspective of passively collected mobile phone data (e.g. CDR data) and by overcoming the uneven spatial accuracy problem in base station coverage areas. We suggest applying a dasymetric interpolation approach that is well known in population mapping studies. With this paper we aim: (1) to put forward a generic conceptual framework for a multi-temporal function-based dasymetric (MFD) interpolation method for mobile phone data; and (2) to empirically investigate how and to what extent the proposed method improves a night-time and daytime population distribution derived from mobile phone data. In doing so, this paper also contributes to dasymetric population modelling by combining spatial, temporal and volumetric mapping techniques while incorporating a functionbased population allocation with mobile phone data. To our knowledge, this is the first attempt to comprehensively apply a dasymetric technique for interpolating mobile phone data.
Next, we provide a brief overview on how the spatial perspective is considered in passively collected mobile phone-based research. We then introduce a dasymetric interpolation approach and put forward the conceptual framework of the proposed dasymetric method and demonstrate the empirical results in the case of Tallinn, Estonia. Finally, we discuss the method, obtained outcomes and future steps.
2. Background and literature review 2.1. The spatial perspective of passively collected mobile phone data To a large extent, the spatial accuracy of passively collected mobile phone data, such as CDR data, is determined by the locations of base stationsthe geographical location of each mobile phone use is allocated to the base station that provided the network signal. Hence, the spatial accuracy of data corresponds to the coverage areas of a mobile network base station, which is spatially not fixed and varies according to the density of base stations (for some exceptions, see, e.g. Calabrese et al. 2013). The simplest way to examine given data from spatial perspective is to use (aggregated) geographic locations of base stations (antennae) as point-based units. Latter approach is widely applied in individual mobility studies (González et al. 2008, Yuan et al. 2012, Järv et al. 2014) and its relation to social phenomena and processes (Phithakkitnukoon et al. 2012, Demissie et al. 2013, Järv et al. 2015, Wesolowski et al. 2015. However, this type of data is difficult to integrate and validate with other data sources. Certainly, CDR data is further assigned to discrete areal polygons using different techniques. Most commonly, a spatial overlay technique is used to attribute given data to predefined spatial (e.g. administrative and statistical) units given the locations of a base station , Trasarti et al. 2015, Williams et al. 2015. Conventional choropleth mapping allows for integrating and validating the mobile phone data with other attribute data in a given spatial division. However, choropleth mapping ignores the fact that: (1) the spatial distribution of base stations does not depend on predefined spatial unitsan administrative unit may have none or many base stations; and (2) the spatial division of coverage areas and predefined spatial units does not utterly coincidea base station located at one predefined spatial unit may provide a mobile network signal partly (or entirely) to other predefined units and thus represent CDR data from an area that is larger than where the given base station is actually located.
Another widely applied technique is to interpolate the study area to theoretical coverage areas (polygons) of a base station using a Voronoi tessellation (Delaunay triangulation) technique in Euclidean space (e.g. Wang et al. 2012) with some exceptions (see Iqbal et al. 2014, Jacobs-Crisioni et al. 2014. This way, station-based mobile phone data can be attributed to spatial units indicating base stations' coverage areas. The use of Voronoi tessellation provides more accurate spatial distribution as compared to previous approaches; however, the resulting spatial division is not compatible with other existing spatial divisions of administration and, hence, does not allow data integration or validation with data in other spatial divisions. Moreover, the two overlay techniques described above depend on the locations of base stations, and any changes in the spatial configuration of a mobile network over time may create biases in both longitudinal and repeated research.
The latter weaknesses can be overcome by applying a straightforward areal weighting spatial interpolation technique; the spatial division of coverage areas and its attributes are simply interpolated by areal intersection to desired spatial units such as statistical grids (Candia et al. 2008, Louail et al. 2014 or administrative areas (Soto et al. 2011, Deville et al. 2014. Areal weighting is also applied the other way round; predefined spatial units and its attributes are interpolated into coverage areas of base stations (Kang et al. 2012, Doyle et al. 2014. Indeed, using statistical grids or administrative areas enables data integration and validation with other data sources in any spatial division. However, this method assumes that mobile phone data as a proxy for people is equally distributed on a plane surface without considering temporality, environmental context or societal structures that shape human presence and mobility in space and time. Several attempts are made to apply more sophisticated spatial interpolation techniques to enhance the spatial accuracy of mobile phone data within a coverage area of a base station. Ratti et al. (2006) applied a neighbour linear interpolation method based on centroids of a Voronoi polygon to represent the probable population in a raster surface. Csáji et al. (2013) refined the spatial locations of people within each Voronoi polygon based on the probability density calculation from the maximum likelihood estimation and the signal decay of a base station. Similarly, Pei et al. (2014) applied a probabilistic distribution approach to refine the spatial accuracy of people within each Voronoi polygon to grid cells using the inverse distance weighting. Nevertheless, the given sophisticated spatial interpolation techniques still assume space as a plane surface and people in space as a function of the distance from a base station. To the authors' knowledge, only in the supporting information in Deville et al. (2014) was there a preliminary attempt made to apply the dasymetric technique by ancillary land-use data to interpolating CDR data.

Dasymetric interpolation approach and the integration of time
In population research, several spatial interpolation methods are used to go beyond aggregated and predefined census tract and administrative units for finer-scale spatial resolution (Eicher and Brewer 2001, Mennis 2003, Langford 2006. Given the feasibility and reliability of widely used dasymetric mapping, the dasymetric interpolation technique is one of the best approaches for refining the spatial resolution of population data (Wu et al. 2005, Mennis andHultgren 2006). The dasymetric technique for refining population distribution is successfully applied, for example, for studying tourism (Vaz and Campos 2013), accessibility (Langford et al. 2008, Shannon 2014, risk exposures and disaster management (Linard andTatem 2012, Smith et al. 2014).
In general, the dasymetric spatial interpolation technique transforms population data from one set of spatial units (source zones) into another spatial division (target zones) using additional ancillary data sourcesdata that can be related directly or indirectly to the spatial presence of people for assisting the interpolation. Physical environmental (land use, land cover and zonal data) information is widely used to extract a populated area from a non-populated area and further conduct spatial weighting based on selected attributes (Dobson et al. 2000, Ruther et al. 2015. Additional ancillary variables considered in dasymetric mapping include the national mailing information (Langford et al. 2008), addresses of businesses (Greger 2015), road network proximity and traffic counts (Smith et al. 2014), points-of-interests (Bakillah et al. 2014), night-imagery (Dobson et al. 2000), detailed cadastral maps (Maantay et al. 2007) and building functions with vertical and volumetric information (Aubrecht et al. 2009, Greger 2015, Biljecki et al. 2016. In recent years, dasymetric interpolation techniques have witnessed fast development given new data sources, improved statistical assessment for estimating accuracy, and advancing multiple areal interpolation with relational data sources, 3D modelling and spatio-temporal interpolation (see, e.g. Nagle et al. 2014, Ruther et al. 2015, Biljecki et al. 2016, Mennis 2016. Interestingly, the concept of the dasymetric interpolation technique is well suited for integrating the temporal perspective (hourly, weekday and seasonal rhythms) for population mapping (e.g. Martin et al. 2015, Mennis 2016. Temporally sensitive multidimensional dasymetric interpolation modelling of a population based on census data has already been elaborated upon since 2000 (Dobson et al. 2000, Aubrecht et al. 2009). Since then, different conceptual frameworks have been proposed for dynamic population distribution modelling while also distinguishing population subgroups, such as local residents, visitors and transit (Bhaduri et al. 2007, Aubrecht et al. 2014, Martin et al. 2015, Mennis 2016. In these studies, the spatial distribution of the population is interpolated as a function of time, while the spatial layer is related to timedependent ancillary data sources about human presence and activities (e.g. time use survey, activity travel diary).
However, these advanced dasymetric interpolation methods do not consider the verticality of a built environment, and vice versa, spatially advanced building population mapping methods do not consider temporality (Ural et al. 2011, Biljecki et al. 2016, except a study by Greger (2015). In urbanised societies, the vertical dimension and volumetric data about a built environment are two of the essential environmental attributes that determine the spatial distribution of people (Ahola et al. 2007, Biljecki et al. 2016. Thus, we propose a conceptual framework of the MFD interpolation for mobile phone data that also takes into account the verticality of a built environment, similar to Greger (2015). This enables obtaining more accurate spatio-temporal information about human presence and mobility derived from mobile phone data such as CDR, which increases the applicability and reliability of given data.

The conceptual framework of an MFD interpolation for mobile phone data
Since mobile phone data as a proxy for people is predominantly estimated by coverage areas of a base station (or antennae) as source zones, we propose a generic MFD interpolation method that first disaggregates the distribution of mobile phone data within each coverage area, and then aggregates mobile phone data to the desired spatial division as target zones using different spatially and temporally sensitive ancillary data sources (Figure 1). At minimum, three ancillary data sources are neededa spatial layer with land-cover data, volume (height) of built environment and time-dependent human activity data.
In general, the MFD interpolation method for mobile phone data has five modelling steps ( Figure 2): (I) the preparation of a physical surface layer; (II) the spatial disaggregation of a physical surface layer by both source zones (i.e. coverage areas of a base Figure 1. The representation of MFD interpolation methodpoint-based mobile phone data at base station level (a)is usually assigned to theoretical coverage areas of a base station as source zones (b) and interpolated to target zones using simple areal weighting method (c). The proposed MFD method disaggregates mobile phone data within each source zone using ancillary data (d) to transform it to desired target zones (e).
Figure 2. The conceptual framework for the proposed MFD interpolation method for refining mobile phone data as a proxy for people in five modelling steps. station) and target zone layers; (III) the integration of time-dependent human activity data with a disaggregated physical surface layer; (IV) the integration of mobile phone data with a disaggregated physical surface layer; and (V) the spatial aggregation of disaggregated surface layer to desired target zones as the output of the MFD method. In the MFD method, any desired spatial division can be used as target zones according to any research needstatistical grid cells (as in this case study), census tracts, administrative units, transport analysis zones or any other spatial division. Next, each modelling step is described in the following subsections.

The preparation of physical surface layer
First, the physical surface layer is prepared, where (1) the vertical dimension is incorporated and (2) a functional attribute is assigned, which can be linked to human activity data. The vertical dimension is essential since the verticality of a built environment determines the population density whereas the best indicator is the total floor area of a building (e.g. Biljecki et al. 2016). The method is flexible regarding the vertical dimension, and input data can be either ready-made data or estimated (from building footprints and heights) for the buildings' total floor area (m 2 ) or volume (m 3 ).
The activity function attribute of each spatial unit enables linking the estimated human presence by activity type in time with the physical surface layer. The method is flexible regarding the detail level of activity function classification as long as it is possible to link with time usage of people by classified human activity data (Eurostat 2009, UNSD 2016) and corresponds to available input data and research needs. We propose to apply six activity function typesresidential, work and school, retail and service, transport, restricted area and other areas, since these types have distinguished temporal usage patterns.
The required physical surface layer may be already available in case of detail cadastral data, including the vertical dimension or 3D models on a built environment (Biljecki et al. 2016). However, there are several ways to prepare a given layer, for example, by integrating land-use and/or land-cover data with building information; building footprints may be derived from cadastral maps, satellite imagery or volunteered geographic information databases such as OpenStreetMap (OSM) or GoogleMaps. Building height can be derived from areal imagery and remote sensing such as LIDAR measurements (Alahmadi et al. 2013). While information about the vertical dimension at building level is not available, more rough estimates may be applied such as the average height of a city block or a neighbourhood.
Building functions related to human activity types in the spatial layer can be further enhanced with additional crowdsourced data such as OSM and online address (business) registers (Greger 2015), and even assign several functionalities to a building if such data is available. Finally, each spatial unit in the physical surface layer includes three attributes: activity function type a; surface unit type (building; non-building) u; and a vertical dimension (number of floors; height).

The spatial disaggregation by source and target zones
In the second step of the MFD method, two spatial layers are applied in addition to the physical surface layerthe spatial division of mobile phone data as source zones and the spatial division of desired spatial units as target zones (Figure 1). The spatial division of mobile phone data is usually represented by the theoretical coverage areas of a base station using a Voronoi tessellation method. The desired spatial division of target zones is applied according to a research need.
First, a geometric union technique is applied for these three spatial layers to disaggregate the physical surface layer into subunits in order to designate each subunit to a unique source zone j and target zone z. Second, the area for each disaggregated subunit polygon is calculated, which enables calculating an estimated total floor area FA for each spatial subunit s by multiplying the area with the vertical dimension regarding the number of floors or height, whereas in both cases the default value is set to 1 for nonbuilding units. This allows for a calculation of a relative floor area RFA for each subunit s within a base station j given the total sum of estimated total floor area FA of all subunits within a base station j as follows: At this point, each disaggregated subunit in the physical surface layer has three additional attributes: source zone ID, target zone ID and a relative floor area RFA within a coverage area of a base station to which a given unit belongs.

The integration of time-dependent human activity data
In the third step of the MFD method, the disaggregated physical surface layer is linked with time-dependent human activity data based on the activity function type of a spatial subunit. The link allows for a more accurate interpolation of mobile phone data between disaggregated spatial subunits within each base station coverage area, while taking into account the probabilities of human activity types. In general, the probability of people to conduct certain activities at certain spatial subunits at a certain time can be acquired from a time-use survey (Eurostat 2009, UNSD 2016 or activity travel diary data (see Schönfelder and Axhausen 2010). Given data sources allow for an extraction of the average distribution of people by activity type in time that can be associated with spatial subunits based on the activity location type. Any classification of human activity types can be applied, as long as the activity classification is congruent with activity function types of a spatial subunit, such as associating home activities with residential areas. Ideally, the data source is from the case study area. However, the activity classification can represent a more general (e.g. national) level or even from a neighbouring country or elsewhere, as long as the classification represents human activity patterns similar enough for the case study area given the similar social and environmental contexts.
The temporal resolution of human activity data in the MFD method depends on available data sources and research needs. Thus, the basis of the method is the distribution of people by activity type at given time intervals, for example, hourly intervals. The estimated distribution of people by activity type (e.g. being at home) in a given time unit can be further refined if input data contains more fine-grained information about weekly (working day vs. weekend) and seasonal (e.g. summer vs. winter; holiday weeks and months) variations in mobility and activity patterns of people.
In this modelling step, an estimated human presence EHP in time unit t is calculated for each disaggregated spatial subunit s within a base station j given its RFA and timedependent probability for human presence, which is a combination of a daily hour factor H, a weekday factor W and a seasonal (monthly) factor M, according to a human activity function type a and spatial unit type u (land or building). The total sum of EHP of disaggregated spatial subunits within each coverage area of a base station j in time unit t is 1. EHP is calculated as follows: MFD method can be used to calculate EHP for the whole population, whereas it can also be calculated separately for each subgroup of a population (e.g. youth, working-age people, and elderly; local vs. non-locals) if the given division is available for both human activity data and mobile phone data. Another possibility is to calculate EHP separately for local residents and non-residents (e.g. transit, visitors and tourists). This would provide a more accurate refinement of population distribution within a base station. It is possible to derive the hourly distribution of people by activity type from time-use survey data. Activity travel diaries can provide weekday and seasonal differences in the probable human presence by activity type (Schönfelder and Axhausen 2010). Seasonal differences can also be derived from longitudinal time-activity studies where personal time use in indoors and outdoors (e.g. Hussein et al. 2012) can be linked to spatial unit type (building, land) as an indication for seasonal influence. This combines hourly, weekday and seasonal factors for calculating the distribution of people by activity type as a function of time.

The integration of mobile phone data
In the fourth step of the MFD method, mobile phone data is linked to spatial subunits in the physical spatial layer by base station j (source zones) and disaggregated into each spatial subunits according to its EHP by given time unit t (e.g. hourly intervals as in this study). While mobile phone data is considered as a proxy for population mapping, it is recommended to normalise mobile phone data to represent the relative population distribution within the study area. Thus, the total sum of the relative share of mobile phone data RMP in a study area S in time unit t is always 1. RMP for each base station j is calculated from all mobile phone data MP conducted within a study area S in given time unit t as follows: Next, normalised mobile phone data of each base station j in time unit t is interpolated into disaggregated spatial subunits s within a coverage area of a given base station j regarding EHP coefficients of each spatial subunit s. A relative observed population ROP for each disaggregated subunit s for given time unit t is calculated based on an estimated human presence EHP t and a relative share of mobile phone data RMP t in the given time unit t as follows: 3.5. The spatial aggregation to desired target zones In the final step, disaggregated spatial subunits in the physical surface layer are dissolved by desired target zones z (each subunit is already assigned to a unique target zone in the second step of the method). Thus, the share of relative observed population ROP t of each subunit s within each target zone z in time t is summarised to ZROP as follows: As an outcome, the distribution of a relative observed population derived from mobile phone data, which was originally assigned to the coverage areas of a base station, is now more realistically transformed to the desired spatial division of target zones. In the next section, an empirical implementation of the proposed MFD method is described.

Implementation of the proposed MFD method
We selected Tallinn, the capital city of Estonia, as our case study area to empirically test the proposed MFD method given the availability of mobile phone data. Tallinn is a medium-sized European city and the biggest city in Estonia, located within a total area of 145 km 2 and with over 400,000 inhabitants ( Figure 3). Based on the conceptual framework of the proposed MFD interpolation method for mobile phone data, we developed an open access tool for the Python programming language, which is freely available on GitHub (http://doi.org/10.5281/zenodo.252612).
We investigated mobile phone data from a 1-month study period in March 2015, linked to the 290 base stations that provide the coverage area for a mobile network signal for the case study area. Three different random mobile phone data sets at the base station level are applied: (1) raw CDR data from working days (Monday-Friday) at night (2 AM-6 AM); (2) raw CDR data during working days at daytime (4 PM-5 PM); and (3) the most probable home locations of mobile phone users derived from CDR data using the anchor point method . Comprehensive details about applied mobile phone data sets are provided in Section S1 in the Supplemental material.
In addition to mobile phone data (source zones), we applied five different ancillary data sets derived from six data sources (Table 1) for implementing the proposed MFD method for mobile phone data for Tallinn, Estonia. We applied two target zone layers to investigate the method for 500 m (0.25 km 2 , n =664) and 100 m (0.01 km 2 , n =16,280) statistical grid cell layers from Statistics Estonia.
Since ready-made data is not available, we prepared the physical surface layer using four ancillary data sets: (1) land-cover data to classify land-cover parcels by six activity function types (residential, work, retail and service, other, transport, restricted); (2) building footprint data to integrate buildings with the main usage function (residential and public, non-residential); (3) normalised Digital Surface Model (nDSM) data from LIDAR to calculate building heights for estimating the number of floors; and (4) the Open Street Map database to provide additional information to extract buildings with public, service and retail as the main functions. An overlay method was applied to create a physical surface layer where each disaggregated spatial subunit in the physical surface layer has three attributes: the number of floors, activity function type and spatial unit type of either building or land.
In the second step, a geometric union between the physical surface layer, the Voronoi polygon layer representing theoretical coverage areas and official statistical grid layer for target zones is applied. We received two different disaggregated physical surface layers since we are investigating the results of MFD method at two target zone levels. In the third step, we applied a national time-use survey data and calculated the average hourly distribution of people by activity type to estimate the human presence for spatial units  Residential population data at building level, which is aggregated to both target zones (100 m and 500 m grid cells) The Population registry data, Tallinn municipality 2015 in the physical surface layer. We also applied a seasonal factor to refine the estimated human presence for each disaggregated spatial subunit. In the fourth step, we applied all three mobile phone data sets and calculated the relative observed population according to MFD. Finally, we created six spatial output layers in the two target zones for three data sets. A detailed description of the implementation of the MFD method in Tallinn is provided in Section S2 in Supplemental material.
To compare the outcome of the MFD method, we created another six spatial output layers where population distribution derived from mobile phone data is interpolated to target zones using the simple areal weighting (AW) interpolation method. We applied population register data at the building level that is aggregated to both target zone layers as the best available baseline to empirically demonstrate how the proposed MFD method improves a night-time (sleeping) population distribution compared to the AW method. To evaluate the accuracy and validity of the proposed MFD method against AW method for refining population distribution derived from mobile phone data, we performed three statistical analysis: (1) linear regression for comparing correlation coefficients and standard error of the estimate (Maantay et al. 2007); (2) the mean absolute error; and (3) the coefficient of variation (CV) based on root mean square error of each target zone normalised by the baseline target zone value (Mennis 2016).

Population distribution by activity function type
In general, the proposed MFD interpolation method refines a night-time population distribution derived from mobile phone data by activity function type significantly better than areal weighting (AW) when compared to reference data (Figure 4). For the case of a night-time population distribution, some 94% of people are at home according to a national time-use survey. In comparison, the proposed MFD method relocates 86% Figure 4. The population distribution by activity function type by MFD and AW interpolation methods regarding three mobile phone data sets in comparison to population register and national time-use survey data. of a night-time population derived from night-time call activities to residential areas, whereas with AW the given share is only 22%. In the case of a sleeping population distribution at night, we can consider that 100% of population register data indicates a sleeping population in residential areas. In comparison, we applied home anchor points of mobile phone users derived from call activities as an indicator for the sleeping population. The MFD method relocates 100% of home anchor points to residential areas, whereas the given share is only 24% in the case of the AW method.
During the day (as an example of 4 PM to 5 PM), population distribution by activity function type between the two interpolation approaches is less obvious, although population division with the MFD method coincides more with a national time-use survey.

Night-time population distribution in target zones
A visual comparison between two interpolation methods in interpolating the spatial distribution of a night-time population derived from home anchor points ( Figure 5) and call activities conducted between 2 AM and 6 AM ( Figure S3 in Supplemental material) reveals distinct spatial differences regarding both target zones; the MFD method relocates the population to residential areas compared to the AW method, which distributes people equally in space.
The comparison of population register data as a baseline distribution for a night-time population distribution and population distribution in the case of home anchor points ( Figure 6) and call activities ( Figure S4 in Supplemental material) demonstrates the differences with the baseline distribution between two interpolation approaches, especially in case of 100 m grid cells. In general, the population distribution using the MFD method has fewer differences with population register data, whereas the influence of the method is more significant in areas where larger coverage areas of a base station occur.
Interestingly, the given difference maps reveal an inherent weakness of mobile phone data. There is a tendency to systematically overestimate the population in the city centre (blue colour) where human communication is significantly more active. However, the aim of the proposed MFD method is not to resolve the given bias, but to refine mobile phone data (e.g. population) within the coverage areas of a base station. Thus, the proposed method relocates the population into more populated target zones and increases the number of target zones with no population as compared to a simple AW method ( Figure S5 and Figure S6 in Supplemental material).  Table 2 summarises the statistical comparison of the MFD method and simple AW method for refining mobile phone-based population distribution in comparison to population register data as a baseline distribution for a night-time population. Regardless of the evaluation measure, the MFD method outperforms the AW method; the former has higher correlation coefficients with a lower standard error ( Figure S7 and Figure S8 in Supplemental material), lower mean absolute error and coefficient of variation of RMSE, especially for the 500 m × 500 m target zone resolution. The MFD method improves population distribution more for home anchor points since it corresponds better with population register data, as both indicate a sleeping population. The finding supports the visual evidence presented in the previous subsections suggesting that incorporating spatio-temporally dependent ancillary data in interpolating mobile phone data by coverage areas of base station clearly improves the refinement of human presence to target zones as compared to a simple areal weighting interpolation method.

Daytime population distribution in target zones
A visual comparison between two interpolation methods in interpolating the spatial distribution of a daytime population derived from call activities conducted between 4 PM and 5 PM (Figure 7) reveals fewer distinct spatial differences regarding both target zones. During a given time period, call activities are clearly concentrated to the city centre, where a dense base station network already exists, and the proposed MFD method has a less significant effect on refining population distribution as compared to a night-time population.
Regardless of target zone resolution, MFD method has a less significant effect in relocating a daytime population to target zones within the study area (Figure 8) given the strong correlation in population distribution between two interpolation approaches ( Figure S9 in Supplemental material). On a smaller scale, however, some significant population relocations occur at grid level. In the case of 500 m target zones, for example, in almost 50% of all target zones, the MFD method relocates ±0.01-0.05% points of the relative observed population ( Figure S10 in Supplemental material). This indicates ±40-200 relocated people in a target zone as compared to the population estimate by the AW method. Furthermore, almost 10% of all target zones have more than ±200 relocated people in a target zone with the MFD method.

Discussion and conclusions
The implementation of passively collected mobile phone data such as CDR data to investigate human presence and mobility is mushrooming in social sciences. However, the given data has theoretical and methodological challenges that one needs to acknowledge. This paper gives attention to one of the important challenges that has not been widely discussed to datethe uneven spatial resolution of CDR data due to the uneven spatial configuration of mobile network base stations, and its spatial interpolation. We introduced a dasymetric interpolation approach as one promising way to overcome this. Thus, we proposed a generic conceptual framework of the multitemporal function-based dasymetric (MFD) interpolation to enhance the spatial resolution of mobile phone data, while taking into account the socio-spatial structure (land use, building volume) and time-dependent human activity data.
The empirical findings demonstrate that the applied MFD interpolation method transforms CDR data as an indication of population distribution to desired target zones (statistical grid cells) more accurately than a simple areal weighting method. The finding was confirmed by evaluating the results of both interpolation methods against population register data as a baseline for a night-time population. In general, the results were consistent with call activity and home anchor point data sets and two different target zone levels with 100 and 500 m resolution. The MFD method had less of an effect for a daytime population interpolation compared to the areal weighting method at the city level; however, it refines the population between target zones significantly on a smaller scale, which is essential, for example, in planning and risk assessment at the micro level.
As the results clearly show, the proposed interpolation method refines the population distribution estimates within a base station coverage area. Yet, it does not resolve the inherent bias of mobile phone datathe fact that it is influenced by the variance of human interactions with mobile devices in space and time, whereas one spatial outcome is the tendency to overestimate the city centre, given its higher probability for human interaction. However, the proposed MFD method can reduce the given biases to some extent when introducing population subgroups to the method, which is similar to Martin et al. (2015).
In general, the method can be improved by providing more accurate ancillary data about the socio-spatial structure and human activities or by incorporating additional data sources. For instance, traffic count data refines an estimated human presence for different sections of a road network. Also, more accurate knowledge about coverage areas of a base station would improve the spatial interpolation of CDR data since commonly (also in this case study), a straightforward Voronoi tessellation for theoretical coverage areas of a base station is applied, which does not coincide with the actual coverage areas. Certainly, more detailed research about the sensitive scale issue of EHP at different spatial resolutions and the comprehensive validation of accuracy estimates of the method need to be conducted in future.
We did not demonstrate how the proposed MFD method can be applied to examine societal issues. However, the method is generally applicable in the broad field of social sciences since the method: (1) can transform mobile phone data to any desired spatial division of target zones (e.g. grids and census tracts) given the research need; (2) enables reliable data comparison, integration and validation of mobile phone data with other data sources; (3) enables reliable mobile phone data interpolation for longitudinal and repeated research; and (4) can improve a systematic estimate of population distribution in time and space from a small scale (e.g. neighbourhood level) to a country level (e.g. global level), depending on mobile phone and ancillary data availability. Thus, the method can provide dynamic population distribution mapping to assess object-oriented risks (Smith et al. 2014) or to model more realistic spatio-temporal accessibility to services (Tenkanen et al. 2016).
We only applied the proposed interpolation method for population modelling at an aggregated level; however, it can also be used for person-based research for refining personal mobility patterns and activity spaces at the individual level. The method can be developed to refine the probable location of an individual within a base station depending on the individual's spatio-temporal activities derived from one's mobile phone usage. For example, individual meaningful locations (e.g. home and work) derived from one's CDR data at the base station level  allow the model to calculate personal probabilities for a given individual to be associated with a certain type of spatial subunit within certain coverage areas in a given time. Certainly, one must acknowledge and preserve the privacy of phone users regarding the use of mobile phone data (see, e.g. Yin et al. 2015).
We conclude that the proposed multi-temporal, function-based dasymetric interpolation method offers a promising approach to increase the usability and reliability of passively collected mobile phone data. Our empirical findings provide a solid base to improve the spatial interpolation methods for mobile phone-based research. Furthermore, this study contributes to developing dasymetric population modelling by incorporating mobile phone data. To facilitate the future development of the proposed approach, we share the implementation of our method in GitHub as open access Python code.