Construction of a refined population analysis unit based on urban forms and population aggregation patterns

ABSTRACT The population analysis unit (PAU) is the basic unit employed in studies of urban populations. The commonly used PAUs are mostly administrative divisions, regular geographic grids. However, these units are different from urban forms, and cannot be used to consider the characteristics of population distributions and flow changes. In this study, we proposed a method for constructing a fine population analysis zone (FPAZ) based on the population aggregation pattern and urban form elements. First, considering the spatial structure of a city and the fine-grained demands of population analysis, the basic analysis unit was divided according to the functional heterogeneity of the population activity region at the micro-scale by combining urban form elements. Next, a population aggregation preference model was established by considering the spatial distribution characteristics of the local aggregation of the urban population flow and long-term stability characteristics depending on the dynamic changes in entrances and exits. Finally, we divided the FPAZ combined with the microstructural elements. Experimental results showed that compared with other types of PAUs, the FPAZ was more consistent with the urban morphology and was an appropriate and general spatial unit for expressing the accurate characteristics of population distributions and changes at the micro-scale.


Introduction
Urban population analysis intends to obtain the spatiotemporal information, structure, and patterns of population distribution and population changes. We collectively refer to the basic spatial units and spatial objects used in spatiotemporal surveys, rule mining, and visualization of urban population analysis results as population analysis units (PAU). Static population distributions or dynamic population changes can be represented by PAUs, and then used for spatiotemporal analysis . The shape, area, and spatial continuity of PAUs directly determine the precision of population representations and the accuracy of analytical results (Zhao P and Zhao S 2016). Therefore, an appropriate and general PAU division method is essential for effectively expressing urban population activity information, analyzing spatiotemporal population change patterns, and supporting population-related applications (Wong 2009).
Various units have been used or divided into long-term urban population studies. At present, considering the conditions and application requirements for PAUs, the main division methods are based on research data, the requirements of the application scenario, and spatial scale. However, these methods mainly consider objective conditions, such as the data quality, spatial scale, and application requirements, whereas they are not concerned with essential requirements, such as pattern analysis of population distributions and changes. In particular, at the micro-scale, the results obtained using these three methods are affected by various problems, such as differences from the urban form, low continuity of the spatial units, and insufficient spatial resolution, thereby leading to inaccurate representations of population information and poor universality. According to the level standard of the urban administrative management unit, the PAU with an average area smaller than the township level can be defined as the micro-scale. Thus, these methodological approaches cannot support detailed and accurate long-term analysis.
In the present study, based on a detailed analysis of related research, we described the main classifications and research results obtained using the current PAU division methods. We then proposed a method that considered the characteristics of the population distribution and flow changes at the micro-scale, hoping to reduce data dependence and that conformed to the urban spatial structure. Meanwhile, our method supported micro-scale investigations of urban population distributions and dynamic changes in their characteristics, as well as pattern analysis.

PAU Based on the spatial scale
Area value is the basic attribute of PAUs, so, the construction based on the spatial scale requirements is the general applied method. According to previous research, the spatial scales used in urban population research mainly comprised the macro-, meso-, and micro-scales (Dong et al. 2019).
Macroscopic and mesoscopic population information results are the traditional information used to support overall urban planning and management, and thus PAUs have generally been constructed based on this scale to conduct population analyses. Administrative management units are traditional macroscopic and mesoscopic units. These units are provided by governments and they can directly match census data and other types of statistical data, and thus they are generally used as standard verification units. Patel et al. and Zhao et al. obtained population distribution data within administrative divisions and demonstrated the effectiveness of their models based on comparisons with statistical data (Patel et al. 2017Zhao et al. 2019. Analysis using administrative divisions also provides intuitive feedback for government management and policy planning based on scientific data (Mennis 2009). Wilson et al. evaluated the range of impact for the 2015 Nepal earthquake and the population distribution based on the boundaries of district administrative units (Wilson et al. 2016). Zhu et al. analyzed the spatial and temporal changes in the urban population based on street level administrative units (Zhu et al. 2017). Chen et al. assessed the real-time exposure of the urban population to PM2.5 based on district level administrative units in China (Chen, Song et al. 2018b). In addition, to study the spatial patterns of urban macro-populations, some researchers treated the urban circular structure as the spatial unit Ma 2006 Mao et al. 2019). Due to increases in populations and the diversification of urban development, fine-scale population analysis has become the key issue in urban management but it is restricted by the spatial scales of the traditional administrative units. Thus, some researchers have used buildings as PAUs because they are natural micro-elements in cities. Buildings can provide more refined spatial or attribute data to associate with population information. Qiu et al. and Lwin et al. used building volume information to establish a static population distribution estimation model (Qiu, Sridharan, andChun 2010 Lwin andMurayama 2011). In addition, building information can be combined with geographical names and addresses (Ural, Hussain, and Jie 2011), land use (Bakillah et al. 2014), and other types of spatial geographic data to improve the accuracy of analyses. Furthermore, the effective acquisition and modeling of building information are conducive to the mining of spatial and temporal population patterns (Yao et al. 2017).
In general, administrative divisions and urban circles are mainly used at the macroand mesoscales to obtain overall population information for a city. Building units can provide high precision spatial information to support population analysis but the spatial continuity of building units is low, and thus the representation of dynamic population information is defective.

PAU Based on the requirements of application scenarios
Different application scenarios have specific demands for the size, shape, and attributes of PAUs. Therefore, considering the scenarios is important for the method used to divide PAUs. For examples include urban transportation, health care, functional structures, and residential life. In urban traffic analysis, previous studies usually employed traffic analysis zones (TAZs), which are divided by traffic management departments based on the main roads (Yuan, Zheng, and Xie 2012). Thus, Cheng et al. studied the traffic travel rules for an urban population based on TAZs and Global Positioning System data from taxis (Cheng, Liu, and Gao 2016). Further, some researchers study regional transportation accessibility with related facilities elements (Tahmasbi and Haghshenas 2019). Health care researchers usually consider the geographic elements of health scenarios and related events. Mayaud et al. evaluated the fairness of urban residents' access to health care from a community perspective (Mayaud, Tran, and Nuttall 2019). Liu et al. and Wesolowski et al. studied the spatiotemporal characteristics of population arrival at public health facilities based on the buffer zones in medical facilities (Wesolowski et al. 2015Liu et al. 2019. Functional structure researchers consider the unit of urban land planning or the area of population activity. Some studies treated TAZs as spatial units according to the restrictive effect of main roads on population activities (Yuan, Zheng, andXie 2012 Yao et al. 2019). Other researchers analyzed functional structure based on the elements of urban facilities such as subway stations (Gong, Lin, and Duan 2017). Residential life researchers consider the representative population residential regions to match population attributes. To estimate the number of people living in the city, Zoraghein et al. chose residential land as the spatial unit (Zoraghein et al. 2016). Gao et al. selected residential areas for the elderly to study the characteristics of a specific population (Gao, Wu, and Yan 2020). Nowak Da Costa et al. select urban building areas to evaluate the impact on the flood of the urban population, and constructs factors such as the building's function and the number of residents (Nowak Da Costa, Calka, and Bielecka 2021).
In general, the methods described above were based on the spatiotemporal characteristics' description of the scene and the demand for population attributes information. However, these units are highly context-dependent, and their universality is weak.

PAU Based on research data
Research data are important sources of information for urban population analysis and are used to carry out statistics and model input feature extraction. Therefore, to match the recording form of research data or to address limitations due to their quality, some researchers constructed PAUs with data as the core to accurately obtain the spatial location and population-related attribute information they contain.
Raster data is one of the research data commonly used in urban population analysis. To match the resolution, previous researchers used geographic grids comprising regular polygons to obtain effective representations in pixels. In particular, some studies used night light data at different resolutions to investigate the population distribution characteristics based on grids (Wang, Fan, and Wang 2019). Other data with similar characteristics include remote sensing spectral data (Webster 1996) and land use types (Langford 2006). Furthermore, due to the characteristic rules used for division, geographic grids can also provide support for the rapid statistical analysis of other types of information in the units and the integration of multi-source data to perform population analysis (Calka, Nowak Da Costa, and Bielecka 2017Zhang et al. 2018Sinha et al. 2019. With multiple types of sensor big data applied to population analysis (Li, Huang, and Emrich 2019), due to the limitation of the recording methods employed, some researchers divided PAUs based on the data acquisition characteristics and mathematical theories to conduct population analysis in a scientific manner (Řezník, Horáková, and Szturc 2015). According to the mobile signaling data acquisition method, DeVille et al. and Kubíček et al established Voronoi polygons with the location of the mobile base station at the center to perform statistical analysis of the data (Deville et al. 2014;Kubíček et al. 2019). To study population movement patterns, Ma et al. constructed Tyson polygon with 173 subway stations based on the recording mode of smart card data (Ma et al. 2017).
In general, the methods described above are mainly based on the form or quality of research data to accurately extract the spatiotemporal and attribute information such as location, functional, or density, thereby facilitating data fusion and effectively improving the accuracy of population spatiotemporal analysis.
In summary, population analysis is strongly supported by these methods, but they also have some drawbacks, especially at the micro-scale. Because the stable population spatiotemporal change regular patterns have a certain spatial scale range, there is a lack of methods that can consider the representations of population distribution and changing characteristics, as well as the long-term regularity of the analysis results.
Thus, we designed a method for constructing a spatial unit at the micro-scale called a fine population analysis zone (FPAZ) for population analysis in an area based on the urban population aggregation pattern and urban form elements. In the first step, we combined the elements comprising urban spatial regions, such as urban main roads and water systems. The basic analysis unit (BAU) which is a temporary kind unit was divided according to the functional heterogeneity of the population activity region at the micro-scale and the urban form of the city by considering the urban spatial structure and the fine-scale needs of population analysis. The second step involved building a population aggregation preference model by considering the spatial distribution characteristics of local aggregation in urban population flows and the long-term stability characteristics depending on the dynamic changes in entrances and exits. In the last step, the microstructure elements in the BAUs were combined with local entrances and exits to divide the FPAZs suitably to represent and analyze the characteristics of the population distribution and dynamic changes. Finally, we used the FPAZ construction method to divide the spaces in Shanghai. Based on three applications comprising functional zone identification, static population distribution estimation, and dynamic population prediction, we verified the adaptability of the FPAZs to the urban form and their capacity to represent the population distribution characteristics and population changes, as well as comparing them with other spatial units.

Methodology
In the present study, we aimed to construct a spatial unit that effectively represented the characteristics of population distributions and changes, provided basic spatial objects of analyzing and obtain stable spatiotemporal population patterns of the micro-scale by considering the urban form and population aggregation pattern. Considering the diversity of population distributions and spatiotemporal differentiation, we constructed FPAZs based on the population flow aggregation model and urban form elements, where this method combined the static structure of a city with the characteristic of dynamic changes in populations. This method had three main components: divide the study area into BAUs, define the population aggregation preference model, and subdivide the BAUs into FPAZs.

Population aggregation pattern
The idea of FPAZ division in our method is based on the characteristics of water system flows and natural lake formation. According to the flow process in natural water systems, water flows from its source through rivers, tributaries, streams, and other multi-level water networks, before eventually forming multiple lakes with irregular shapes. The shape, size, and water sources are relatively stable for these lakes (Figure 1) after their long-term natural evolution.
Based on the urban population distribution and change process, we suggest that the population flow and aggregation patterns are similar to the characteristics of the process involving the flow of water into lakes. As shown in Figure 2, the population moves through a multi-level traffic network and finally aggregates in local spatial regions through entrances and exits. These aggregation areas form irregular spatial units restricted by urban form elements. Therefore, these units should have the following characteristics to simulate a lake formation process.
(a) The boundary and shape of the unit conform to the natural urban form, and the spatial granularity is smaller because the population is concentrated in local areas. (b) Due to the natural urban ecology process, the characteristics of the population distribution and changes in the unit exhibit long-term stability. (c) The unit has one or more fixed entrances and exits to allow population flows, and the aggregation and changes in the population are associated with them.
These characteristics mean that the units are spatially small but also suitable for analyzing the characteristics of population distributions and changes, which are the results of the interactions between population activities and urban forms. Therefore, the population aggregation model and urban form elements are combined to construct the FPAZs.

Construction of BAUs based on urban forms
Form elements have direct limiting effects on spatial urban divisions, the formation of fine spatial units, and changes in population flows and aggregation patterns. Previous studies have shown that urban roads provided transportation facilities for urban population movement and that they were the most basic urban form element, with important effects on the understanding of urban functions, spatial structure, and planning management (Gong, Lin, and Duan 2017). Therefore, we used urban form elements to define the spatial scope and scale, and divide BAUs as a temporary kind unit.
Traffic analysis zone (TAZ) is a commonly used PAU based on urban form elements. However, TAZs do not consider the functional heterogeneity of the internal and external spaces formed by the main road (Figure 3), thereby affecting the accurate analysis and representation of the characteristics of fine-grain population distributions and changes by increasing the randomness.
To address this problem, based on the topological processing of urban form elements, such as main roads, main water systems, and green belts, we integrated the urban forms toward the population aggregation areas at the micro-scale and filter polygons containing multi-level roads, water systems, and green area (Figure 3(d)). A problem is that urban roads have many levels and uneven spatial distributions, so the road polygons are sparse and difficult to extract. Considering their geometric properties, the areas of road polygons are generally smaller than TAZs, so a threshold value can be set together with a minimum area for a TAZ. According to the spatial topological characteristics, the road polygons only contain the spatial elements related to road facilities. Therefore, the steps required to extract the road polygons are as follows.  (a) In the partition region R, select the trunk road set with the highest level L i (i = {1, 2, 3, … r}, i = 1 is the highest level) to merge the regional boundary B, and construct the spatial unit set U i = {ut i1 , ut i2 , ut i3 , … … ut in } by spatial topological processing and transforming the line elements into surface elements. Next, the area uta of all units in U is obtained to construct a histogram of the area value distribution, and the minimum area threshold minArea is determined according to the mutation point in the histogram group number. For any ut ip (p = [1, 2 … n]), if the area uta ip is less than minArea and the unit contains no elements other than the road facility type, the unit is marked as a road polygon. (b) In addition, if utaip is greater than minArea and it contains only the spatial elements associated with road facilities, then the unit is also marked as a road polygon. (c) Based on U i , the road polygons are filtered and secondary main road L i+1 is selected, which is merged into a spatial unit set Steps A and B are then repeated based on U i+1 until all the levels of the main roads have been merged to complete the extraction of the road polygons.
Finally, according to the analogy described in Section 3.1, we treated a river as a natural factor that affects traffic where the randomness of the population change is similar to that on urban roads, so the water polygons are filtered and the relatively stable lakes are retained as parts of the BAUs. Based on these methods, by acquiring and merging the urban form elements as well as considering the functional heterogeneity of units at the micro-scale, we filtered the multilevel road polygons and main river polygons by combining their attributes and spatial characteristics to finally obtain BAUs that define the spatial scope.

Population aggregation preference model
At the micro-scale, the stability of population changes is affected by the urban forms but it is also related to the behavior of the population. According to the characteristics of the lake formation process described in Section 3.1, water flows into a lake through fixed water intakes. Similarly, the distributions and changes in populations in local areas are mainly dependent on entrances and exits that allow flows under the restrictions of urban form elements. Thus, people can choose their preferred entrances and exits according to the aims of their activity, distance, and other factors that affect movement behavior. The long-term stable spatial aggregation is generated due to behavioral similarity and a spatial unit with a stable population distribution and changes are formed with entrances and exits. Therefore, obtaining the main entrances and exits that affect population movements is essential for constructing FPAZs.

Definition and extraction of entrances and exits
In this study, entrances and exits are defined as fixed channels that allow the inflow and outflow for the population of the local area, which is generally visualized in the form of point elements in space. Entrances and exits are not two kinds of different point elements, but a general term with elements that have the same function. Therefore, entrances and exits have the following characteristics.
. The attribute characteristics represent the entrances in some functional areas for population activities and they have graded or orientation information. . The spatial characteristics are distributed near the boundaries of local areas.
According to the attribute characteristics, we obtained the entry and exit elements that were consistent with the attribute descriptions by constructing a semantic dictionary, with main categories comprising the orientation, class, number, and direct description (Table 1).
Furthermore, we considered an internal road that was directly connected with the boundary of a BAU for spatial superposition to obtain the entrance and exit elements that were consistent with the spatial characteristics ( Figure 4).

Construction of the population aggregation preference model
Considering the preference characteristics of the population flow, the entrance and exit factors with different grades and spatial locations have diverse effects on the flow and aggregation of the population. According to the description in Section 3.1, the main entrances and exits with large population flows and higher positions significantly affect the aggregation of populations in local areas. Therefore, in our proposed method, we considered the behavioral pattern of the population, construct a population aggregation preference model, and obtained the main entrance and exit elements according to their characteristic semantic attributes and spatial locations.
Based on the two characteristics described above, the population aggregation preference model marks the main entrance and exit elements according to the semantic dictionary and population flow simulation.
First, according to the definition and method described in Section 3.3.1, some of the entry and exit elements were obtained from the matching results using the semantic dictionary. Therefore, the steps in this process are described as follows.  (a) Pedestrian entrances and exits with low flow characteristics are not the main population flow nodes. Filter the elements according to the word 'Pedestrian' in the 'Class' category in the semantic dictionary. (b) Based on the location, the entrance and exit elements with major positions in space are extracted according to similar semantic words such as 'Main' and 'Front' in the 'Orientation' category in the dictionary and then marked as the main entrance and exit elements. (c) According to the hierarchical similarity characteristics, the entry and exit elements with the same name but different serial numbers have similar effects. Therefore, the main entrances and exits were marked based on the specific serial number in the 'number' category in the dictionary.
Next, considering the spatial location characteristics, the effects of the entrance and exit elements in different locations on population aggregation in the BAUs were different. We then labeled the main entrance and exit elements using the population flow simulation method based on the shape, area, and spatial distribution characteristics of the BAUs. Population flow simulation is an idealized population movement fitting method combined with an urban vehicle road network. The aim is to compare the differences in the geographical accessibility distances of multiple entrances and exits within the same BAU. Therefore, the steps followed for marking based on the population simulation method are as follows.
(a) First, consider the population flow features for geographical areas inside or outside the division region R, and then select a geographic location and assume a virtual starting point Op j (j = 1, 2, 3, 4) for population movements, where all of the entrance and exit elements are set as the target locations ( Figure 5), and main roads (Figure 6(a)) are treated as the network for population movements. In addition, to only consider the effects of geographical distance factors on population movements, the Dijkstra shortest path algorithm (Bertsekas 1993) is used to construct the population flow path network set EP j = {ep j1 , ep j2 … ep jk } (Figure 6(b)). (b) Considering the spatial topological characteristics of road intersections, the population flow path network is expressed as the ID sequence of the intersection elements (Figure 7(a)). For example, the path is shown in Figure 7(b) is represented as the sequence of numbers comprising '0607030405.' The difflib algorithm (Myers 1986) is then used to calculate the path similarity rs for each simulation in the same BAU. The difflib algorithm is based on the longest common subsequence (LCS) problem (Formula (1)) combined with the idea of dynamic programming (Formula (2)) and the gestalt pattern matching algorithm ( Figure 8). The algorithm is implemented by using Python's open difflib library. The exits and entrances are then marked based on the similarity threshold tv. In the present study, tv was set according to the sensitive walking distance for the urban population to infrastructure points, i.e. 400 m (Teh et al. 2019).
(c) For any two entrances and exits in a BAU, calculate rs for the corresponding flow path ep jv and epjw(v = [1, 2 … k],w = [1, 2 … k], v! = w). If rs < tv, the two entrance, and exit elements are marked as the main entrance and exit elements, respectively. If rs > tv, the two entrance, and exit elements are marked as belonging to the same cluster. These operations are performed on each entry and exit element in turn to obtain the clustered set CTR = {ct 1 , ct 2 , ct 3 … … ct x }. Next, for the entrance and exit elements in any cluster ct g (g = [1, 2 … x]), by considering the representative characteristics of the center of the spatial cluster, calculate the geographic distance of the corresponding path dis to obtain the distance set DS = {dis g1 , dis g2 , dis g3 … … dis gy-}. We obtain the new sequence SQ = {s g1 , s g2 , s g3 … … s gy } by sorting DS and then select the median element of SQ to mark the main entrance and exit.  the path similarity, and selection in only one direction may result in the main entrance not being marked. For example, the two paths determined based on the geographical orientation in Figure 9(a) have high similarity, so there is only one main entrance and exit. However, the paths determined based on another orientation in Figure 9(b) have low similarity, so they will be marked as two main entrances and exits.
Finally, the semantic dictionary and population flow simulation are combined to obtain the main entry and exit elements ( Figure 10). The population aggregation preference model constructed in this study considers the semantic characteristics of the entry and exit elements to distinguish the different factor attributes, but it also considers the shape, area, perimeter, and other factors related to the BAUs, thereby distinguishing the different spatial characteristics. The main entrance and exit factors with important effects on the local population concentration are then obtained.

Construction of FPAZs based on microstructural elements
Given the spatial population aggregation characteristics, we obtained the FPAZs by further dividing the BAUs based on the results determined using the method in Section 3.3. Considering the effects  of the microstructural elements on the spatial scope of the population, the steps required to divide the FPAZs are as follows.
(a) Extract the microstructural elements in the BAU, such as internal roads and artificial lakes. The unit set M (Figure 11) is obtained based on the spatial topology process and the boundaries of the BAUs. (b) According to the set of the main entrance and exit elements, E = {e 1 ,e 2 ,e 3 … … e h } (Figure 11), we set an attribute name EntryC for each unit of M. If a unit contains only the main entrance elements, set EntryC based on the ID numbers of the elements; otherwise, if a unit contains more than one or no main entrance and exit elements, calculate the Euclidean distance between the centroid of the unit and each main entrance, and obtain the ID number with the shortest distance to set EntryC. (c) Based on the classification results, the same units with EntryC in M are spatially fused and all of the fused units are merged to obtain the FPAZ in the BAU (Figure 12).
In summary, the FPAZ division method first defined the spatial scale and basic division range according to the large and medium scale urban form elements, before further segmentation  based on the population aggregation preference model and microstructural elements. This method considered the effects of urban forms as well as the changing distribution characteristics and longterm stability characteristics of the population aggregation pattern. Moreover, due to the limitations of urban form elements, FPAZs have different area values and irregular shapes, which can better reflect the heterogeneity of urban space. Meanwhile, the method has certain data adaptability, if the number of entrances and exits changes, we can re-divide the changed BAU into new FPAZs according to method flow after section 3.2 and the changes in data will not lead to large-scale re-division work. In addition, if the model inputs more accurate urban form elements, the BAU division results can be optimized and updated. At the same time, the updated urban form elements can be superimposed on the existing BAUs to identify the intersecting BAUs and re-divide them.

Evaluation
The adaptability of urban forms and the capacity to represent population distributions and changes were considered to construct FPAZs. In this study, urban area function identification, static population distribution estimation, and dynamic population prediction were conducted to compare FPAZs and other types of spatial units.

Urban area function identification
Urban area function identification is to classify the land use attributes, these areas are planned by the government based on the urban structure and jointly determined by the actual land use conditions, which is an important reflection of urban forms (Chi et al. 2016). According to the urban land classification standards for China, the functional areas can be divided into the following six categories: residential area, public management and service area, commercial service area, industrial area, road and traffic facility area, and green and park area (Li, Zheng, and Chao 2020). According to a mature method in China, we calculate the proportion of POI based on public cognition (Table 2) and classify the high proportion of POI type into the corresponding area function categories(Li, In formula (4), Fsc is the statistical frequency of a functional category, i is an index of POI types belonging to this functional category, P i is the number of a kind of POI, Aw i is the public recognition value corresponding to POI type, j is an index of functional categories and Fsr is the proportion of the statistical frequency of a single functional category. If the factor Fsr exceeds 0.5, it is identified as a single function area; otherwise, it is identified as a mixed functional area.

Static population distribution estimation and dynamic population prediction
Static population distribution is important data for government management and decision-making (Ye et al. 2019). Previous studies have shown that population distribution information can represent urban spatial forms as well as reflecting the correlations between various types of geographical elements and populations. Random forest is the most common model used for population distribution estimation and it has been implemented in the WorldPop, a representative population dataset (Bai et al. 2018). In this study, the random forest model and POI data were used in evaluation experiments to compare the capacities of different PAUs for representing population distribution characteristics with the same modeling data. We use features constructed in townships combined with a township-level census input training model. Then due to the conversion of the spatial scale, the results of the model calculation are the population weight of different PAUs, and the census data of the districts is further allocated according to the weight value with formula (5).
Pop m = DisPop n * Weight m In formula (5), Popm is the estimated population in a PAU, m is an index of the number for PAUs, DisPop n is the census data of the district corresponding to this PAU, n is an index of the number for districts, weight m is the PAU weight value calculated by the model. The estimated results were verified by using the mean Average Precision (mAP) and mean absolute error (MAE) as indicators. Dynamic population prediction is the difficulty at present in urban population research and stable population changes are highly significant for pattern analysis (Chen, Pei et al. 2018a). Therefore, we performed dynamic population prediction to verify the stationarity of different PAUs with changes in the population time series. The Autoregressive Integrated Moving Average model (ARIMA) is a suitable time series prediction model for stationary series. In this study, a series constructed using mobile phone location data was predicted with the ARIMA model, and comparisons were conducted using the root mean square error (RMSE) and mean absolute percentage error (MAPE) as performance indicators.

Study area
We selected Shanghai in China as the experimental area. Shanghai is a municipality and a megacity in China, and a center of economic, financial, trade, shipping, and technological innovation. According to the results of the Sixth National Census completed in November 2010, Shanghai's permanent resident population was 23.0191 million in 16 districts (Figure 13).
Shanghai has a complex urban form and many functional types (Sun and Wei 2014). As a megalopolis, the spatial structure of Shanghai has a multi-center, multi-regional, and diversified development pattern, with a complex spatial and temporal distribution, and a changing population pattern (Zheng 2015). Therefore, it is necessary to conduct detailed, accurate, and reasonable information acquisition and pattern analysis for the population of Shanghai.

Data
Geospatial data were used in this study, including POI data, road network data, and water data provided by Gaode, which is the leading navigation, digital map, and location service solution provider in China (Table 3).
POI data were important for the division and verification of FPAZs in this study. In total, 678,433 POI data were extracted from spatial overlaying with Wuhan administrative region vector data and POI data. Meanwhile, POI data are divided into 14 categories by Gaode (Table 4).
Road network and water data were also employed for dividing FPAZs. In total, 777,740 road network data were extracted from the study area, which belonged to eight road grade categories and 10 road attribute categories (Table 5), each road type comprised the grade and attribute. The spatial resolution of road data is smaller than that of the township level, so it has good spatial fineness and data timeliness. In addition, about 4985 water data were extracted.
Mobile positioning data (MPD) were also important data sources used in this study. The MPD data were provided by a big data advertising and operation company in Shanghai. MPD records comprised the user's real-time latitude and longitude positions obtained using a smartphone app, and the records were collected and aggregated from advertisement modules inserted into mobile apps.
We obtained about 60 million mobile phone location data records from the central area of Shanghai from December 1, 2019, to December 31, 2019, with four fields comprising the user ID, longitude, latitude, and timestamp, where the positioning accuracy was 10 m. We calculated the number of daily active users based on the MPD records ( Figure 14). The average number of daily active users was about 700,000, i.e. about 15% of the permanent population. Figure 13. Districts of Shanghai. The red area is the center of Shanghai.

BAU construction
In this study, about 250,000 elements of the main roads above the county level and 800 polygons of the main water system were extracted in Shanghai (Figure 15(a) and (b)), and about 50,000 spatial units were constructed. According to the method described in Section 3.1, we extracted relevant elements from POI data for traffic and storage types together with the catalog of urban road facilities (Table 6).
Together with the extracted point elements, about 30,800 polygons containing roads, water systems, and greenery were filtered, and 96.4% of the overall research area was retained. Finally, about 19,000 BAUs were obtained (Figure 15(c)).

FPAZ division results
Using the method described in Section 3.3, about 45,000 exit and entrance elements were matched with POI data for Shanghai according to the semantic dictionary, where about 8,000 elements were filtered using the word 'Pedestrian' and repeated positional information. In addition, combined with the attributes of 'parking lot connection' and 'POI connection' in the road network attribute category, the boundary intersection points of internal roads and BAUs were extracted, and about 39,000 entrance and exit elements were finally obtained ( Figure 16). The population preference model was constructed using these data. First, 159 main entry and exit elements were labeled by combining the direction and classification categories in the semantic dictionary.
Next, according to the method described in Section 3.3, two traffic stations located in the east and west directions comprising Shanghai Hongqiao Railway Station and Shanghai Pudong Airport, respectively, were selected as the starting points for the population flow simulation. All of the entrance and exit elements were treated as terminal points, and roads above the county level were used as the network for the population flow simulation. In total, about 78,000 population flow paths were constructed by using the shortest path search algorithm. The flow path similarity was calculated according to the urban population walking to public facilities with a perceived difference distance of 400 m and the threshold was set at 0.95 to obtain a total of about 6,000 BAUs containing more than two main entrances and exits. Using the FPAZ division method described in Section 3.4, about 38,000 FPAZs were obtained ( Figure 16). In general, the median of the FPAZ area is approximately 260 m grid cells. Furthermore, the average area of FPAZs located in the central districts of Shanghai is approximately 150m∼300 m grid cells and that of other FPAZs is approximately 350∼500 m grid cells.

Functional areas identification
The method described in Section 4.1 was used to identify the functional areas in FPAZs, BAUs, all road units (ALRs), and similar scale geographic grids. The identification results for each type of PAU are shown in Table 7. Given the regular characteristics of population distributions and changes, a clear area function is helpful for accurately determining patterns in the population distribution and changes. According to the proportion of single functional areas in index 4, FPAZs contained 57.35% of the total, which was higher than those for the grids and ALRs, but slightly lower than that for the BAUs, thereby indicating that the FPAZs contained a relatively high proportion of single functional areas and  the spatial accuracy was significantly improved. Considering the fit to the urban form, compared with a similar scale and according to the area of POIs contained in index 1, the FPAZs covered more than 74% of the urban area, whereas the 400-m grids only contained 59%. In addition, according to indexes 2 and 4, the FPAZs accounted for 77.5% of the single functional areas in the regions containing POIs, which was slightly lower than that using the 400-m grids, but the single functional areas accounted for about 10% more of the overall region. Thus, the FPAZs had a higher coverage rate for the functional areas identified based on POI data and they were also influenced by the urban forms. Compared with the large-scale units, the proportion of BAUs containing index 1 was   relatively high at 93%, whereas that with the 600-m grids was slightly lower at 71.76%. However, in the region containing POIs with index 2, compared with the BAUs, the FPAZs significantly increased the proportion to 77.5%, which was slightly higher than that with the 600-m grids, thereby indicating that the FPAZs also had a good fit to the urban form at the micro-scale, and this reflected the heterogeneity of the urban functions. Compared with the small-scale units, ALRs accounted for 79.2% of index 2, which was slightly higher than that for the FPAZs. However, ALRs accounted for 56% of index 4, which was less than that for the FPAZs, thereby indicating that ALRs did not fit the urban forms better, but instead the identification coverage rate was reduced in functional areas. In addition, the overall proportion of single functional areas with index 4 in the grid at two scales was lower than those for the other three types of units. Moreover, as the grid size decreased, the coverage by index 4 decreased by about 8%, and thus the regular geographic grid was not suitable for representing urban micro-scale area functions.
To further explore the fits of the FPAZs to urban forms, the spatial distributions of the mixed functional areas were represented as thermal maps of the polygonal centroids. The hot spot search radius was set to 800 m. As shown in Figures 17(a), (c), and (d), the mixed functional areas in the FPAZs, BAUs, and ALRs were concentrated in the central area of Shanghai, including the districts of Jing 'an, Huangpu, and Hongkou, and some areas of Changning, Xuhui, Putuo, and Yangpu. These districts are the core areas of Shanghai with diversified urban functions, and these results are consistent with the planning, design, and actual construction of Shanghai. However, the results obtained with the grids indicated a random distribution state in the center of Shanghai (Figure 17 (b)), where the phenomenon of area functional fragmentation was caused by the regular segmentation grid. Compared with the BAUs, the FPAZ results contained more high-value hot spots in the spatial distribution (Figure 17(a) and (c)), with obvious spatial heterogeneity. Furthermore, compared with the FPAZs, the ALRs found no significant differences in the distribution of the hot spots ( Figure 17(d)).
Thus, our results confirmed that the FPAZs conformed to the urban spatial forms and they provided fine-grained spatial objects, but they were also better at distinguishing urban functions, thereby facilitating the mining and analysis of population patterns at the micro-scale, and achieving a good balance between the spatial scale and area functional representation.

Static population distribution estimation
According to the method described in Section 4.2, we used the kernel density values of POIs in Shanghai as the features and combined them with the census data for all towns as inputs for training the random forest model, the radius of nuclear density analysis is 600 m. Then the population distribution data were obtained according to the weight values and the census data of 16 districts. Finally, the results were summarized at the township scale for verification. We considered BAUs, ALRs, and multi-scale grids as the units for comparison, and the estimated results are shown in Table 8 and Figure 18.
Given the representation of the regularity of the population distribution, the results showed that the accuracy of the FPAZs was highest at 70%. In the same area, we compared indicator MAE at the township level with representative population datasets such as GPW, LandScan, CNPOP, and WorldPop. The accuracy within FPAZs is much better, which only uses POI as modeling data (Table 9). The accuracy and spatial refinement were significantly improved compared with those using BAUs. The ALRs had a smaller spatial resolution but the accuracy of the estimations was lower than that with the FPAZs (Table 8). Figure 18(c) shows that the results estimated with ALRs were skewed to the left of the trend line. In particular, the values estimated for 135 streets were lower than the true values, but this only applied to 120 of the FPAZs. This difference occurred because every space unit in the ALRs only contained a few categories of the POI density characteristics and the nonlinear fitting characteristics of the random forest model resulted in the allocation of higher weights to areas with only small populations, thereby losing the spatial continuity of the population distribution, and thus the populations were underestimated in more streets. For the grids at various scales, the accuracy of the best estimation was 66.8% with 600-m grids and the spatial scale was larger than that of the FPAZs. Therefore, for the same modeling data, FPAZ was better at representing the population distribution, in a similar manner to the functional area identification results. These differences were closely related to the spatial distribution of POI.
To further explore the effects of the urban forms and spatial distributions of POIs, the experimental results obtained using FPAZ, ALRs, and 400-m grids were combined with the mAP indexes. We also converted the POI data into a thermal map to allow their superimposition ( Figure 19).   Figure 19(a) shows that the results estimated with FPAZs had a circular distribution, and the streets with poor estimation accuracy were mainly distributed in the city center and suburbs. Using grids obtained streets with similar accuracy but there was an obvious clustering phenomenon (Figure 19(b)). Compared with FPAZs, the accuracy of ALRs was higher in the central urban area but significantly lower in the suburban areas of the city (Figure 19(c)). According to the statistics of Jing'an, Hongkou, and Huangpu districts in the city center, the accuracy of FPAZ is about 63% and that of ALR is 72%. The thermal map of the spatial distribution of POIs showed that the static population distribution was related to the properties of the urban geographic big data but also their spatial characteristics. The spatial distributions of FPAZs and POIs were highly consistent, and regions with excessively high or low POI densities led to deviations in the population estimations (Figure 19(a)). Geographic grids are regularized units that do not conform to the actual spatial distributions of POIs and they can only be applied for pseudo-fine scale data fitting, and thus the accuracy was similar in the urban center and suburbs (Figure 19(b)). The over-refinement of the ALRs readily led to population underestimates. Therefore, the estimation accuracy was improved in the central area of the city where the POI density was high, but the estimation error increased in the suburbs of the city where the POI density was low (Figure 19(c)).
In summary, the results presented above confirm that the FPAZs and POI data had strong spatial correlations, and thus it was better to combine urban geographic big data related to the spatial  distribution and urban form to effectively represent the characteristics of the population distribution, thereby supporting the acquisition of accurate population distribution information.

Dynamic population prediction
The experimental area considered for dynamic population prediction included the districts of Changning, Putuo, Jing'an, Hongkou, and Huangpu. The permanent resident population is about 4.6 million, which accounts for one-fifth of the permanent resident population of Shanghai. We applied the method described in Section 4.2, where the ARIMA training data comprised 360   Table 10. Given the regularity of the population changes, the RMSE values showed that compared with the PAUs at a similar scale, the RMSE with FPAZs on weekdays was 9.57 and 7.44, and the RMSE at weekends was 7.15. These results were better than those with the two types of geographic grids. When FPAZs were compared with using BAUs, the results obtained with FPAZs were better on weekdays and on the weekend, where they improved the spatial resolution and the accuracy of the predictions. Considering the effects of the urban forms, the RMSE results predicted with BAUs were 10.59 and 8.08 on workdays and 8.03 on the weekend, which was better than those with the two types of grids, thereby indicating the important effects of the urban forms on the stability of the population changes. According to the MAPE indicator, the FPAZs generally performed worse than the two types of geographic grids and BAUs, except for the working day on December 16. Based on these results, we extracted the sample number distribution for each type of PAU ( Figure 20).
Figure 20(c) shows that the sample distribution for FPAZs tended to be in the range of 10-30 (Figure 20(c)). The sensitivity of the MAPE calculation is higher when the value is smaller, and thus the performance of FPAZs was poor according to the MAPE results. In addition, the time when the true value was 0 was covered up in the MAPE calculation, thereby resulting in a mismatch with the time. Therefore, our results indicated that the RMSE index was more suitable for representing differences in the population change.
We also considered the effects of urban forms on the formation of functional areas and the subsequent influence on the stability of the population changes. The results predicted using the FPAZs and two types of geographic grids were characterized using the RMSE index ( Figure 21).
Considering the overall error distribution, Figure 21(a) shows that the error distribution was relatively average with FPAZs, where the low and high error values were uniformly distributed in each area. The error distribution obtained using the grids exhibited obvious spatial continuity (Figure 21(c) and (d)), where the low values were mainly concentrated in Putuo District and the high values in Huangpu District. To further consider the error distribution at the micro-scale, we extracted the typical units with high error values on December 16 and used OpenStreetMap to obtain the functional attributes by manually inspecting the visual maps ( Figure 22).
The high value prediction errors with FPAZs were mainly distributed in the areas with highly random population changes, and the typical regional functions included park, square, and university (Figure 22(a)). Thus, the FPAZs could effectively distinguish between the units with regular and random population changes. The distributions of the geographic grids were characterized by spatial aggregation (Figure 22(b) and (c)) and depending on the samples, the grids divided the urban spatial regions in a regular manner, so the functions were difficult to determine. These units were characterized by highly random population changes, but they also affected the population changes in the surrounding units (Figure 22(b) and (c)).   Overall, from the perspective of unit division and expressing the characteristics of population changes, the RMSE results described above demonstrated that population timing changes determined within FPAZs were more stable, which can support the mining of more accurate rules. Meanwhile, FPAZs are more in line with the form of the city and can be updated according to changes in urban form elements. Moreover, they could be naturally combined with the regional functional attributes to support analyses of population change and patterns, thereby facilitating research into population dynamics at the micro-scale.

Conclusion
Previous Studies do not pay too much attention to the construction of PAU, especially in the context of micro-scale population management, and lack of studies on unit construction from the perspective of population change characteristics. Thus, in this study, we constructed a method based on the aggregation patterns present in population flows to address this problem, thereby allowing us to successfully obtain suitable PAUs for representing the changes in population distributions and flows at the micro-scale. By integrating urban form elements and constructing a population aggregation preference model, we divided FPAZs. Compared with other types of PAUs, we showed that FPAZs obtained better performance in the following three applications: urban functional area identification (single functional area ratio: 57.35%), static population distribution estimation (MAP: 70%), and dynamic population time series prediction (RMSE: 7.44). Further analyses of the results showed that the spatial distributions of the identified urban functional areas, population distribution estimation and population dynamic prediction results obtained with FPAZs were more consistent with the actual spatial distributions of the urban planning and research data.
However, this study has some limitations and unconsidered factors. For example, although our model has data adaptability, it does not fully consider the impact of the spatial distribution characteristics of urban form elements on population distribution, such as the population distribution may not follow the junction point's theory or there is more spatial heterogeneity and unevenness in developing countries because these cities have not formed a stable spatial structure. Another factor is that the road networks may be different from the boundary of some functional areas, but the road networks are more refined to mostly make up for this. Future work will focus on these issues. For densely populated and constructed urban areas, more kinds of urban form elements will be obtained to improve the model, such as the boundaries of functional area, area of interest and building groups. And we will consider combining population mobility data such as floating car trajectory data or social media positioning data to optimize the acquisition of entry and exit elements, which can help distinguish more heterogeneous regions and partially solve the unevenness problem. Moreover, we will apply the FPAZs to practical problems, thereby facilitating analyses and explorations of the spatiotemporal patterns in population changes to help policymakers to manage populations and conduct urban planning more scientifically.
Glossary PM2.5: fine particulate matter; PAU: Population analysis Unit TAZ: Traffic Analysis Zone BAU: Basic Analysis Unit ALR: All Roads Unit FPAZ: Fine Population Analysis Zone

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This research was supported by the National Natural Science Foundation of China (U20A2091, 41771426).

Data availability statement
Data not available due to legal restrictions.
Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.