Reconstruction of incomplete public transportation check-out records by heuristic approaching

Abstract Smart cards, integrated circuit cards, are frequently used to pay transportation fees in many countries such as the United States, the European Union, Korea and others. Smart card transaction histories in public transportation have been used in making diverse traffic policies. Unfortunately, data preprocessing is required to organize records to extract useful information since the raw data contain artifacts and incomplete information. The paired data, i.e. a boarding and its associated alighting, are one of the key pieces of information required to yield the city traffic demand. However, it is difficult to gather the paired data mainly because the card tagging policies and data collecting methods vary from city to city and from institution to institution. The broken trip links between the boarding and its associated alighting information are the most frequent incomplete data in the smart card histories. Although huge budgets are spent on annual traffic surveys in many countries, there are no accurate Smart card data analysis methods that cover all city cases due to high data complexity and limited base information. Among these two restrictions, limited information is more difficult to resolve because it has a large codomain, thousands of stops, for inference under tiny samples. It gives handicaps of applying well-known high-performance approaches such as deep learning and probability analysis. We propose a reconstruction method by a customized clustering algorithm based on coordinates of transactions-occurred stops, user trip histories and actual vehicle movement trajectories for incomplete trip pairs. The proposed method is derived from evaluating each passenger’s historical clusters and trip patterns. In a blind test 79.1% accuracy is achieved in predicting missing alighting stops while conventional methods reach only 42.3% accuracy.


Introduction
Advances in science and technology make it possible to collect massive amounts of data related to people's daily lives.As the emergence of the compact data storage and high-speed internet services, the logging system has been developed dramatically.Public transportation systems, card key access system, credit card payment systems and other systems record logs for security, data analysis and for many other purposes.Systemically collected log data provide hidden information but log data must be well-organized and complete to extract the information from the logs.In reality, data always contain artifacts, biased data and partially broken data.GPS data, for instance, show high variances by environment.The measured signal often has large deviations from the actual position due to the multipath effect, especially when measured near an open land next to a river or a lake.When building trajectories from GPS data, such inaccuracies are often corrected by a map matching algorithm that adjusts error points by referencing map data.Similarly, the reconstruction of incomplete data is a necessary step for any data mining.
The introduction of electronic fare media is a revolution of the 21st century in public transportation fields (Gallaire, 1998).The Radio-Frequency Identification (RFID) credit card is a pioneer of contactless payment markets and bus/train/metro fare payment systems rapidly substituted tokens and paper tickets with the electric medium.Each payment is logged with additional information such as the fare amount, locations, time, etc. Data analysis has brought advancement in making policies based on these log data (Bagchi & White, 2005).In addition, there is a massive literature reviews for looking out how big data concerned to the new green revolution (Wu et al., 2016) and is an idea of a big data-based orchestration architecture for promoting the sustainable environmental development goals (Wu et al., 2018).It is also possible to foresee the expected traffic based on the analysis.For example, many countries run websites for sharing transportation knowledge with the public.Transport for London offers like Public Transport Access Level (PTAL) and forecasting.Regional Transportation Authority (RTAMS) of the USA also provides PTAL analogous index, Transit Access Score and processing transportation data.However, obtained data were normally biased to boarding, or check-in, information due to both expected and unexpected situations.For instance, the boarding tag only policy does not leave logs about alighting stations.Further, there could be errors related to data storage, timestamp and communication.Due to such diverse reasons, it is normal to encounter broken links, showed in Figure 1, in the raw data.However, the boarding and alighting information are valuable data since we can collect more accurate information regarding passengers' trips.The collection of unbroken or linked trips reveals the exact amount of traffic.Thus, many countries acquire them through on-line and off-line surveys spending enormous budges and time every year.Hence, there have been extensive studies to overcome those unbalanced data between check-in and check-out information.
Reconstruction of broken or unlinked trips is recognized as a challenging task in transportation analysis because it relies on an assumption based on passengers' usual behaviors Trepanier, M., Tranchant, N., & Chapleau, R. (2007).It is because of various properties of transportation data such as lacking individual passenger records for machine learning, unexpected sporadic trips, thousands of target stops, partial route or stop changings.The challenging had been actively continued since Barry et al. (2009) proposed the Trip-chain analysis by assuming bus boarding spots from timestamps of Automated Fare Collection (AFC) system combined with bus operating schedules.Trip-chain is a concept that sequential links a user's all-day trips and it was defined from National Household Travel Survey (NHTS) in 2001.They applied two assumptions related to people's movement behaviors and around 90% of survey results were matched to those hypotheses.Thereafter, the trip-chain analysis was tested again for peak hour traffic analysis (L.Zhang, 2007) and verified its accuracy via a household survey afterward (Farzin, 2008).M. A. Munizaga and Palma (2012) constructed trip chains by priorities checking of alighting candidates picked up based on passengers' transit times.Those candidates are extracted from feasible scenarios, and subsequent research studies are followed and validated its accuracy with complete records (Alsger et al., 2016;Nunes, A. A., Dias, T. G., & Cunha, J. F. (2015;) Trepanier et al., 2007).Challenges for similar goals have on-going.Xiao Lei, M., Wang, Y.H., Chen, F., & Liu, J.F. (2012), &Xiao Lei et al., (2012) tried to decide trip origins using Markov chain-based Bayesian decision tree.Subsequently, Jung and Sohn (2017) tried to estimate transfer stop with a classification derived model based on deep learning in mid-2010 and they selected the correct stops with high accuracy among limited number of distractors.
Generally, trip information reconstruction is done at the initial stage of trip data utilization.To obtain key-roles of stations and commute ranges, trip chains were reconstructed by conventional assumptions and categorized based on mobility patterns of passengers (Mohamed et al., 2016).Prediction based on long short-term memory (P-LSTM) helped to anticipate bus arrival time to develop a bus-dispatching model (Huang et al., 2019).They encoded sub-features such as weather, day of week and time segment, and fed them into P-LSTM.Meanwhile, the broken trip chains were also recovered by the same speculation strategies commonly in the earliest phases in the above studies.Travel pattern is another major point of concern for transportation analysis, and there are studies to figure out passenger's regular patterns by tensor decomposition D. Zhang et al., 2015;Maeda et al., 2019).Passenger's social pattern can also be an interesting subject.Lu et al. (2021) derived a relationship between passengers' travel pattern and their social pattern using a new method for measuring the similarities of a pair of subway passengers.
In this study, smart card fare logs were used from Jeonju city in South Korea.This data contains three times more boarding histories than alighting ones, and sporadic timestamps dominate more than half of the whole dataset.Major passengers of the data are not every day commuting passengers such as students and office staff.We drew a conclusion that conventional approaches have low performances with data collected under such conditions after several experiments.Thus, it would be better to rebuild trips with the distance-oriented clustering method without time patterns considered.Passengers' trip patterns are also analyzed and used to decide origin coordinates.Reconstruction in this paper does not mean the perfect recovery of itinerary but supplementation with a given tolerance.
The main contributions of this paper are summarized as follows: we (1) propose a method of a reconstruction algorithm for incomplete trip data with real sparse datasets, (2) suggest the most widely applicable distance-focused clustering method for the transportation data analysis field and (3) show a practical approach toward large scale data analysis from a new viewpoint.
The rest of this paper is as follows.Section 2 represents previous works that alighting stop estimation by spatial and temporal approaches and clustering applied research, as well as brief background theories.Section 3 describes data set statistics that we test for in an experiment, and how we reconstruct incomplete data.Section 4 shows accuracies in diverse viewpoints in comparison to existing methods.Finally, in Section 5, the advantages and disadvantages are discussed including future works.

Related work
Space and time are favorable tools to solve biased trip data problems.He, L., & Trepanier, M. (2015) considered those two variables, space and time, to predict missing parts of broken trip chains.There is an implicit major premise that people habitually use public transportation at the near locations with small time differences of each trip.They combined spatial and temporal probabilities using Gaussian kernel density to estimate potential stops and recovered 11% more broken data than the classical algorithms.J. Zhao et al. (2017) also analyzed passengers' travel habits using smartcard data from the spatio-temporal point of view.Such approaches achieved increased accuracies if there were sufficient records.Nevertheless, their methods do not have high performances at extremely unbalanced, mostly incomplete check-in and check-out pairs, dataset in which we are interested in this paper.Nunes et al. (2016) proposed an Origin-Destinations (OD) estimation method from AFC, Automated Vehicle Location (AVL) and Google's General Transit Feed Specification (GTFS).The GTFS is a kind of text files and it contains typical transaction information such as routes, operation schedules, stops and waiting times at each stop.Their proposed algorithm is mainly based on the trip chain, and it selects the most probable routes from GTFS then matches stops by average time differences to the actual time.The algorithm achieved over 98% matching accuracy from the examination, but the preprocessed data have been filtered by more than two trips a day.In spite of high performance, it is limited to use practically since the exclusion of single-line data.Meanwhile, clustering has been being used to analyze public transportation data especially in finding user trip patterns and trip histories.Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is typically chosen because it is robust to geometric shapes of clusters and noise immune.In addition, DBSCAN does not require pre-defined cluster count like k-Means clustering.Despite these merits, it also has drawbacks that have considerable influence on two parameters: a minimum distance and the number of minimum points.Parameter tuning requires trial and errors to find the best values, and they depend on city types and properties.This is because public transportation usages patterns heavily are affected by diverse factors such as population distribution, a core business of neighborhood, and transportation systems.Ingvardson and Nielsen (2019) found that passengers' behaviors are influenced by trip time, accessibility, vehicle interval and fares influence through a massive survey from 12 cities for 15 years.They also showed that younger and older passengers have different patterns stemming from habits that formed in early life experiences.Allen et al. (2019), hence, analyzed the reasons for public transport satisfaction gaps in a psychological view.They derived three factors-functional, safety and hedonic-by which determinate customers' satisfaction.They argued that there is a hierarchy of transit needs that can be explained by Maslow's Hierarchy of Needs.They showed case studies and survey results of Bus Rapid Transit (BRT) system in South America as evidence.For these reasons, it is more suitable to use clustering methods that well express individual user activity for transportation datasets.
Patterns in public transportation histories can be found and modelled mathematically based on clustering and space-time approaches.J. Kim et al. (2017) illustrates how people use public transportation by introducing Stickness Index (SI).They found that SI increases if travelers use more buses and that users prefer to choose accustomed OD.Xiaolei, M., Yao-JanWu, Y., Chen, F., Liu, J., & Liu, J. (2013) showed that citizen's average trip time of Beijing in China and the number of trips per day are achieved by splitting repeated OD pairs by time from clustering results.Minh Kieu et al. (2013) adopted DBSCAN for figuring out spatial and temporal repeating patterns of public transportation passengers.They selected records satisfying two conditions; there should be at least one trip a day and more than 75% of trips on weekdays.Selected data help with clustering large-size log data and habitual OD pairs.From this data, peak time zones are revealed.Chang and Zhao (2016) also used DBSCAN to earn previous trip events, and Agard et al. (2006) chose k-Means clustering to create passenger groups for passenger behavior analysis.
It is not suitable of using deep learning if the analysis goal is more focused on individual pattern.Deep learning shows overwhelming performances at classification from complex variables combinations.Jung and Sohn (2017) tried to infer missing alighting spots by supervised learning with joined land use data and labeled stop.Their approaching implies that the reason people alighting at specific stops is strongly link to facilities and business sectors of the land.They showed a remarkable performances and possibilities of machine learning adoption for the transportation data analysis field.However, they did not consider that not all passengers alighting at a specific stop due to the same purpose.
Proposed methods above commonly worked with well-organized data in both qualitative and quantitative terms.Unfortunately, the reality does not always offers neat and arranged raw data, and quality of intermediate data, i.e. clearly preprocessed data assure the quality of analysis results.

Background
There are four steps of data preparation at transport analysis in general.Those steps could be enumerated as reading raw data, organizing them, transform to desired shape, and verifying the result (Henrickson et al., 2019).Followings describe brief definition of clustering and two key terms for understanding transport data properties.

Clustering
Clustering is a well-known unsupervised classification method for grouping analogous data in desired viewpoints.The ultimate goal of clustering can be defined as separating or categorizing input according to similar features.Let I ¼ x 1 ; x 2 ; . . .; x n f g be an input set and let C ¼ C 1 ; C 2 ; . . .; C k f g with C i � Ifori ¼ 1; . . .; k be a cluster set.The clustered states are represented as,

Trip chain
In this paper, the trip is defined as a sequence of connected records, which means the time gap between two consecutive records, possibly comprising multiple modals, is less than 30 minutes.The definition considers the allowed transfer time of the city where the experimental data were collected.Moreover, the trip chain means (1) a link of consecutive trips a day and (2) a missing destination estimation method for trip record.Predictions with trip chain are based on two major assumptions; in a day's continuous trips, people are likely to return to the destination of the previous trip for the next trip (peoples start to trip at where they arrived on the previous day), and the last trip of a day ends at the first boarding point on the same day (Barry et al., 2002;.To build trip chain, records are sorted in two keys, cardID and transaction date, at first.Passengers' consecutive moving histories are traceable from the sorted records since cardID is unique by cards.Trips are chained from trip records of the same day, and stops are inferred with basic information such as routes and stops if records are incomplete.Several assumptions in Tables 1 and 3 of 3.3.2are used for alighting stop inference, however, it has limitations in that two sequential trip records at the same stop are required, and only single day trips are treated.In addition, a definition of the same day gives effects on building trip chain because some cities have transportation modals that operate over midnight.Therefore, M. Munizaga et al. (2014) pointed out an unreasonable of dividing trips just numeric day values, and they suggested 4 a.m. as a new separation time when the least activities period of passengers that analyzed from transaction records.

Smartcard data
The following fields of smart card records are in common: Route, Card ID, Card type, Bus ID, Route ID, Boarding and Alighting stop ID, Transaction date and time, and Direction (Chang & Zhao, 2016;Chen & Fan, 2018;Minh Kieu et al., 2013) as shown in Table 1.
Card Identification numbers are encoded or encrypted according to the recommendation of local law for privacy and are normally changed daily (K.Kim & Lee, 2017).However, the changing periods could be different depending on card company.Because smart card transaction records are simultaneously gathered by a central server from Bus Management System (BMS) which are installed on each bus, stacked data has an unordered characteristic, thus sorting mixed records is the first step for analysis.We sorted our dataset by two keys, Card ID and Transaction date&time to focus on individual transaction spots.

Proposed method
The traffic volume between regions or short links is a basic resource for transportation policy decision.It is necessary to have complete paired trip data to compute traffic volume, but survey is preferred in general than digital records processing due to imperfection of records.Although there have been lots of attempts to compensate for these flaws, it has been difficult to find a satisfactory solution.Most difficulties come from unpredictable variables like unusual one-time trips of passengers, different route sequence of the same route ID by date due to irregular route changes, and scarcity of data from individual users.In addition, thousands of stops that subordinated on routes in a city heavily increase the prediction ranges.For that reason, the public transportation field has many NP-hard problems.Fan et al. (2009) proposed an evolutionary algorithm for solving the Urban Transit Routing Problem (UTRP) that is one of NP-hard problems.Jo et al. (2007) adopted the genetic algorithm to resolve another NP-hard problem-fixed charge Transportation Problem (fcTP).A network means a complex of routes in transportation engineering field, and the goal of fcTP is building networks without changing passengers' total fare, and Panchal and Panchal (2015) also treat transportation problems as an NP-hard then represented a way of searching an optimal solution by the Genetic algorithm.Therefore, traffic engineers prefer to select heuristic or probabilistic ways as analysis tools for practical cases rather than experimental dedicated theories.Sadeghi-Moghaddam, S., Hajiaghaei-Keshteli, M., & Mahmoodjanloo, M.
(2017) tried to solve the real-world fixed cost problem with their two polulation-based metaheuristics algorithm.This paper also largely depends on such common assumptions because target data set is more severe information lacks.
(1) Passengers are highly likely to move to the area that were frequently visited before.
(2) Home is either rank 1 or rank 2 spots of visited history, and the work site such as the office and the school also is one of them.
(3) The stops where first boarding and last alighting of a day are close to home.
(4) People walk to destinations if the distances are within 1 km.
(5) People normally stick to a habitually used route.
(6) The untagged previous alighting spot is where the nearest area from the next boarding stop if it is.
These commonly used assumptions are also adopted in this paper.

Dataset
This study used one-month period smart card records of Jeonju city, the Republic of Korea.Jeonju is one of the well-known tourist cities in South Korea and is apart 194 km to south from Seoul.Its area is 206.22 km 2 and 651,000 people live there, and around 140,000 public transportation trips occur daily.Generally, a fare discount policy for transit passengers or travel distance proportion fare systems are adopted in Metropolitan cities in Korea, and they encourage passenger's voluntary smart card tagging when alighting.Transit passenger discount policy induces a passenger tag by offering free transfer within a specified time and number of rides, and the distance proportion fare system also performs the same role by referring check-in and check-out histories as evidence for charging.However, there are no benefits of alighting tagging in rural mid-sized cities like Jeonju.For that reason, large gaps occurred between collected smart card check-out records and the actual alighting amount.
Collected data under such condition has two fatal problems for analysis.Table 2 briefly represents the data information that is used for this paper.Each row of the data contains a single transaction record of boarding or alighting.Therefore, a single trip requires at least two rows have both two actions.The unique transactions in Table 2 mean that a passenger's trip histories have less than or equal to two different stops in the record regardless of the number of trips.These types of data lack information and cannot be used for alighting estimation.This table shows that only one quarter of the alighting records exist compared to boarding and alighting transaction sums.Therefore, around 2.7 million records should be discarded if there are no data recovery processes.Moreover, massive gaps will last between actual traffic amount and valid parts of tagging histories without such actions.The next critical problem is that each passenger has small size trip histories.Figure 2 is a distribution of individual transaction numbers after removing unique data in one month.It is assumed that users' routine daily boarding records are at least two times, originating from home and returning to home, but 75% of users only have less than 20 transactions a month.These distribution characteristics lessen the probabilities of inferring success in practical usages.

Building a stop map
We chose a quad-tree as the data structure of the stop map for fast target searching in the 2-Dimensional spaces.Original location data is encoded in WGS 84 (EPSG:4326) world geodetic system which is the reference coordinate system of the Global Positioning System (GPS).Coordinates are represented in longitude and latitude in this system, but they require a complex formula as below, Haversine, to measure distances between two points because locations are mapped on an Earth ellipsoid (Wikipedia 20 October 2021).

d ¼ 2r arcsin
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi φ 1 ; φ 2 are the latitude of point 1 and 2 in radian

Construct target sets
Shin (2020) adopted the following strategies to redeem data insufficiency with sparse data sets (1) Allowing data duplication and mixing boarding/alighting transactions.
(2) Adjacent stops where the tagging transaction occurred at are included as raw data for clustering.
These ideas assist with estimation and clustering.Cluster centers will not be fixed on any specific stops due to additional near stop positions, and data duplication allows more positional data where frequent visits are assigned to the set.Since raw records of this paper are also sparse, we reimburse data lacking using the same strategies.The stop map pre-built by quad-tree is utilized to obtain near stops while expanding raw data.

Origin determination
The clusters of a user's transaction spots are closely related to the user's primary environment.In order to reconstruct the broken incomplete trip chains, we clustered a user's transaction spots with various clustering methods such as DBSCAN, Meanshifts and so on and the coordinates of stops constituting each route were compared with the distance between the clusters.In some cases, however, it is difficult to determine the alighting stop since users take routes that pass through not just one cluster but more than two clusters.In that case, we assumed that that the last trip destination of a day is highly likely to be the first starting point on the same day as Barry et al. (2002) suggested.In other words, the first boarding and the last alighting stop a day are near a users' origin spot.Therefore, the origin cluster can be determined from this prerequisite.Clusters of user transaction spots can be calculated from S i by general clustering methods such as DBSCAN, Meanshift and etc. Frequently visited places of user i are extracted from results of clustering.Subsequently, a near point from the record produced after clustering the union set of the first trips of all days A i and the last trips B i .
Where k is the last trip of day j The origin point is the nearest cluster of C i 1 from the top 1 rank of C i 2 .
where 1 � r � number of clusters in C i 1 Origin i ¼ arg min rankr D i

Phase 2. Trip record reconstruction
The reconstruction consists of three steps: (1) clustering all stop coordinates of each user, (2) checking the origin cluster among the top 1 and 2 clusters and (3) estimating untagged stops then recovering broken links.

Customized clustering for data property
Step (1), (2) contain clustering processes, and selection of a clustering method fit wanted requirements are essential.For instance, K-means clustering is one of widely used clustering algorithms due to its simplicity and speed, but it cannot determine the number of clusters (k) by the algorithm itself.Thus, that method cannot retrieve passengers' unique trip behaviors without fixing number of Point of Interests (POI).Other most popular density-based clustering algorithms are Mean Shift (MS) clustering and DBSCAN.They do not require that k value when do clustering, but some artifacts, described the next paragraph, can be intruded during processing.Meanwhile, passengers prefer to choose daily travel routes that are less time-consuming when using public transportation (J.Kim et al., 2017).Therefore, far stops, over certain ranges, should not be considered as candidates if bus routes pass by closer stops.Clustering algorithms must be sensitive to distance from a travel beginning point, i.e., the origin, for that reason.Yet, density-based clustering algorithms such as MS and DBSCAN cannot include or exclude target data points into cluster-based distances due to kernel of MS or minimum points of DBSCAN although they have associated Euclidean distance parameters.Figure 3 shows the different clustering results when using each method with the same coordinate data.
The bandwidth of MS and epsilon of DBSCAN were set to 150m for both clustering methods, but sometimes MS separated points closer than 150m and DBSCAN grouped over 150m apart points in Figure 3. We, thus, designed a purely distance-based agglomerative solution that is target data characteristics oriented, and it results in combining all points as needed in the same figure.The fundamental concept of our design is similar to the single-linkage clustering, but some parts were tweaked to speed up the operation.This solution is based on merging members represented as Euclidean Distance Matrices (EDM), and 2-dimensional n point data is converted to an EDM by following (Dokmanic et al., 2015).
The Euclidean distance between point x i and x j is, Then distance d ij comes out from Let r and k denote the maximum distance between intra-cluster members and the total number of clusters, respectively.
where i; j f g \ i 0 ; j 0 f g�; and 1 � l � k; i 0 and j 0 are adjacent indices of i and j Since x i are points on the 2D space, every x i has two coordinate values x i;0 ; x i;1 À � .C m l denotes the coordinate of the m th member of C l , i.e. (x C m l ;0 ; x C m l ;1 ).Then the center coordinates of each cluster P l are, where S l is the size of cluster C l Collecting all d ij composing a distance matrix would make the matrix symmetric.Conventional singlelinkage clustering leave that duplicate values during its process.We modified it can make clusters only using the upper triangle of the matrix and built a cluster ID table to prevent repeated distance table updating.For fast clustering, the member of the cluster ID table is assigned as reference values, i.e., pointer values, and the referenced IDs are modified instead of changing all cluster IDs.
Separately, EDM based clustering would be utilized such grouping or denoising process since EDM can be used at manipulating distance data (Dokmanic et al., 2015)

Alighting estimation
After finding the clusters of each user, the alighting stop is determined based on three different conditions as shown in Table 3.Those are major premises for building trip chains in transportation engineering.Among three cases, conditions 2 and 3 are not applicable to this study since it is rare to have day consecutive records in the test data.Condition 1, however, was adopted since it is most likely to exact stops if the history is recorded.M. A. Munizaga and Palma (2012) assumed that a user transferred into another bus during a trip if there was a stop near the next boarding stop within 1km radius.Passenger behaviors in Jeonju are not different than their research.10% of paired, check-out tagged, records have up to two consecutive stops apart between 500m and 1km, which was set as searching the range.
Following selection rules are set and applied for the rest of trips in which alighting stop is not decided by Condition 1.
(1) The last trip of the weekend is way to the origin (home) (2) Weekday trips are mostly regular in the time domain, but different by passengers Then it is checked whether the boarding date is a weekday or weekend and if the user's boarding route includes two candidate stops from both cluster ranks 1 and 2. Among them, the cluster close to locations where early timestamp values were recorded is treated as the origin.The origin cluster, subsequently, is selected as the alighting point if it is a weekend, but another judgment remains if not.When deciding the origin of each user, all timestamps were put in the same line.We, then, cut the line near the median position to distinguish trip destinations.This time threshold is used as alighting decision standards of such ambiguous trips.For instance, if the user's boarding time is earlier than the center of that line, the trip destination is work sites or schools.Thus, one candidate could be discarded that was close to the origin location and rebuild the broken trip link with the remaining cluster.However, it is risky to fix the distinction based on time bound if a trip is linking from the same day and the next day.M. Munizaga et al. (2014) also explained that reason of alighting point estimating by giving example trips which divided boarding and alighting transaction fiducially at midnight.They, therefore, proposed 4:00 am as a new criterion during the lowest traffic volume period of a day.However, traffic flows are not the same even in every country or at the small city level.For that reason, we adopted passenger customized trip separation time just for judging whether a trip heading for the origin or not.
Algorithm 1 describes the broken trip chain reconstruction concept.Related functions are listed in Algorithm 2. Base information such as trip records, passenger lists, route sequences, and clustering results are loaded first.The route_list is either a route sequence of base information or actual records sorted in time ascendance of each vehicle, and we take the latter to reflect practical trajectories.Next, two consecutive trips are loaded then checked if they are recorded on the same day.In the case of the same-day trips, the alighting stop is figured out with condition 1(c1) of Table 3.Since passengers walk for boarding to the next trip in most c1 cases, the target stops are limited by allowable walking distance Trepanier, M., Tranchant, N., & Chapleau, R. (2007;Z. Zhao et al., 2018) and assumption (6).The function get_alighting returns all stops within search range r from the target stop, 1st parameter, in distance ascending order.When more than two target stops have the same distance, there are no priorities between them.Thus, randomly select one of them as the result.
If timestamps of two trips indicate different days, it equals condition 2 and 3 (c2 and c3) of the conventional algorithm.However, the target stops should be inferred by other methods since there is no confidence that whether those two stops are correlated with each other.We compare centers of all clusters of passenger p to the stop of the bus route which passenger p boarded.Function search_cluster shows these processes and there are three branches by cases.Cluster ranks 1 and 2 are candidates if any results came out from the first two clusters, then follow the origin comparing process, but the remained whole clusters are targets for checking if not.The final destination in the second case is a stop within the walking allowed distance by the cluster ranking sequences.

An example of clustering
Figure 4 is an example clustering result of a user.There are 52 stops duplicated and drawn in white circles in Figure 4. We increased them to 192 stops by appending adjacent stops where the transaction occurred, and the appended points are represented with a cross mark embedded into the circles.The bloated stops act as preventing cluster centers biased on certain bus stops as well as binding adjacent stops into the same cluster.Dashed circles illustrate clusters, but it does not mean cluster sizes, and numbers inside circles mean the rank of the cluster.Cluster 1 is defined as the origin cluster of the cardholder according to the home selection strategy we described in 2.3.Thus, Cluster 2 represents the most frequently visited sites like work sites or schools.
Table 4 is a summary of gathered members in each cluster.We interpreted sharing portions of each cluster as possibilities of the passenger alighting at those clusters.Therefore, it is possible to judge the target stop in cluster ranking order.

Accuracy check
The goal of this paper is not checking how successfully training user behaviors but picking reasonable stop positions from given limited information.Therefore, inference ability was tested under a severe condition that the alighting information is artificially removed from the dataset.Records are filtered into 430,443 clean pairs have complete base information from the raw record then remove all the final alighting histories including transfer trips for blind tests.Those deleted alighting histories were used for the validation.However, alighting records for transfer are kept since passengers must remain logs for a fare discount, that is, there is no possibility of missing transfer records in such cases.Thus, each user's records are sparser in the blind test set, and reconstruction is harder than when using the original set.Two reference methods were selected for comparison (1) Trip chain (TC) analysis and (2) Spatial and temporal (ST) approach.The former was chosen because trip chain analysis is the most effective strategy in trip estimation research, and the latter due to the same variable domains are used in this proposal.More recent techniques like a deep learning-based estimation were excluded because the raw data have not enough individual passenger data for model training.Hereinafter our proposed method refers to Infer from Cluster (ICL).
TC works as following strategies of Table 3, and the working mechanism of comparing algorithm (ST) employs kernel density estimation.It uses visited location records under the hypothesis that passengers have tendencies of moving via the accustomed routes (He & Trepanier, 2015).Those user behaviors hold both spatial and temporal histories, and the missed alighting stop locations are selected by combining their probabilities.To produce the probability, unbroken trip chains are filtered at first then locations and time histories where the transaction occurred are separated.Subsequently, probability densities are generated using the Gaussian Kernel function, and average values are calculated on two dimensions.Let i, j, and m be smart card ID, a route, and the sequence of transaction records respectively.Then, H j i;m is the set of trips.N m denotes the number of H j i;m of the broken trip, distance density E d x ð Þ and time density E t x ð Þ are defined as below (a), (b) and (c).Test were executed by a newly implemented program with these definitions.Finally, alighting stops are inferred from a set of multiplication of spatial density and time density at every stop.This algorithm is effective at linking broken trip chains as much a half of unresolved trip chains of conventional algorithms including TC.
Quantities of inferred accuracy are displayed in Figure 5 that describe the proposed algorithm, ICL, which shows promising results even under limited data quantities.Unit one stop is 450 meters and it is from the average distance between all stops of Jeonju.Since the purpose of reconstruction is to better understand traffic volume for policymaking, this error bound value is reasonable.In addition, the fact that around 40.3% of passengers do pre-tagging on the way destination is another strong supporter of the error bound (K.Kim et al., 2014).Results of condition 1 (c1 of Figure 5) come from inference with next boarding coordinates according to common premise (6) in section 3, and they are common in all three methods.The total matched counts equal to sum of this common count from c1 and each count from c2 and c3.Conditions 2 and 3 performed estimation for data that have no next stop information, and reconstruction ability gaps are mainly subordinated on these cases.TC tries to find missing candidate stops from the first boarding stop within a maximum of 2 days from the current boarding date under conditions 2 and 3, but two days consecutive records are rare in our time discontinuity test dataset of Jeonju.On the other hand, ST reached somewhat higher capacities under the same condition.This increment understands contribution of partial records that have regular time pattern.However, the algorithm considers just unbroken trips as histories for reference, hence it undergoes the limitation of the sparse dataset itself.In contrast, ICL is free to such restrictions, because ICL encompasses all transactions whether they are linked or not.This difference made ICL have an outstanding performance in overall comparisons than the other two algorithms, and a maximum of 47 times gaps are achieved to TC in the perfect matching result.
The proportion of each error bound, 1 stop apart and 2 stops apart, by three methods are as follows.17.6% of TC results were perfect matching, 40.9% were correct within 1 stop bound, and 41.4% were correct in 2 stops bound in sequences.It means that over 80% of estimations came from behind 1 stop apart error bound.ST had slightly higher exact matching ratio than TC as 25.5% and the other two bounds are almost half and half of the remaining data.ICL reached 53.8% at the perfect matching portion whereas the other two conventional methods stayed under 50%.Furthermore, the result in 2 stops range of ICL occupied just 13.9%, therefore 86.1% of estimation is positioned under 1 stop bound compared to TC is 58.6% and ST is 63.1%.From Figure 6 two bars represent gaps versus ICL, and they are steadily increasing, further, accumulated accuracy slopes  are gradually increasing as the error bound is expanding.As a result, the accuracies stacked up to 2 stops error bound were differences of almost 20%.

Accuracy by clusters
Examining the results, 86.3% of estimation comes from top 1 and 2 rank in ICL, and Figure 7 demonstrates the last 13.7% spread from cluster rank 3 to the last cluster.No further correct estimation increment from the 23rd cluster of all passengers.On the other hand, the missing trend rises as increasing the cluster rank numbers.Consequently, Figure 7 strongly supports our assumption in 4.1 that number of cluster members can be interpreted as the visiting possibility of passengers.Figure 7 provides insights into the number of each passenger's visited places are limited to around 10 locations in general if using public transportation, and early stopping points where compromising time and accuracy also exist near 10 spots.
When we group all transaction occurred coordinates, point to point distances referred for deciding cluster members.Figure 8 demonstrates relationships between clustering range and accuracy of two different clustering algorithms, MS and ICL.Clustering radius 150m is a peak point of two algorithms, but the accuracy of ICL is drastically dropped to 250m whereas MS keeps up to 400m.Regarding peak distance, we ascertained this in two ways.One is the majority of passengers prefer to access stops within 150m and stops of Jeonju are well organized for passenger access.We mentioned artifacts of density-based clustering in section 3.3.The dragging accuracy of MS stems from the artifacts, so distortion might exist when interpreting experiment results.In contrast, ICL is more suitable for the interpretation of distance data since it is a purely distance-based algorithm.Consequently, the accuracy of ICL is quickly dropped before the clustering radius of 450m is reached, the average distance of stops in Jeonju, but the accuracy is slightly higher than MS less than under 250m range.In the real world, the public transportation policies can be changed anytime because of operating efficiency and other reasons.Moreover, all components consist of traffic systems that are changing with the growth or decline of the city.Unless all associated information is stored all the time, uncontrollable data mismatches are inevitable.Because bus stops and routes information that we used have a 1-year gap with smart card data, bus stop IDs are not perfectly matched.This mismatch occurs unrecoverable cases, about 20% of estimation from methods 1, 2 failed due to a poor ID table, and 22% from our proposal was failed for the same reason.Figure 9 shows net accuracy after removing those stop ID problems.From the results, we conclude ICL would reconstruct around 80% of broken trips although collected data is still less than ideal.The short lines in the bars of Figure 9 indicate total time consumptions of each method.ICL is inferior in terms of processing speed compared to the others, but it offers more understanding at data.We will describe some profits of data interpretation with clusters in the following section.

Discussion
This paper describes a way of reconstructing omitted alighting information by inference from spatial coordinates clustering and it showed promising results even though smart card fare records are sparse and time discontinuous.Lots of previous research did not comprise such lacked records of rural cities, and AI-based learning algorithms are only tested under limited conditions.For these reasons, there have not been satisfactory methods for resolve faced problems in practical application.It is an important result that achieved such high accuracies with very sparse dataset in situations while governments are spending high on budget every year to secure traffic amount data.
We achieved noticeably higher accuracy at broken chain recovery than existing methods during the blind test.ICL successfully linked about 80% of the incomplete trip chains even though raw data has deficient number of records for the conventional algorithms.Moreover, ICL clusters using purely distance itself, so there are no artifacts at clustering process, and it is widely applicable for multimodal transportation data analysis, i.e., transfers between different transportation systems.Our distancebased clustering would also be applied if the distance is the most important property of data.It does, however, required a longer time to reconstruct broken trip chains than the others.The longer processing time comes from two-pass structure of the proposed algorithm, clustering and evaluating, and the evaluating process takes more time than the other.Instead of loss at processing time, our method gives distinguishing insights at passengers' trip behaviors interpretation.For example, it is possible to get ratios of trip for commute versus trip for social and leisure by treating rank 1 and 2 clusters as home and working places.Breaking trips into for commute and others gives chances of analysis economic status of the area.People in Jeonju city use buses for routine activities with almost 70% and for other activities such as personal gathering, exercise, and hobby with the remained 30%.Furthermore, details might be deeply analyzed by connecting bus stop locations and land use data.Regarding the relationship between accuracy and cluster rank, higher time consumption would be mitigated if we limit testing clusters that are included in the estimation process at the balancing point.Simultaneously, the error rate would decrease by excluding the estimation of the missing increasing parts.We discarded data that have invalid IDs at this study, but it is possible to overcome that obstacle through processing and combining actual vehicle trajectories in the future work.

Figure
Figure 1.Well-formed trip records are linked to each other through boarding and alighting tags, but broken links have no connecting information between trips.

Figure
Figure 3. Undesired points are grouped in DBSCAN, whereas MS separates the same clustering intended points.

Algorithm 1 :
record reconstruction process Data: pass_list = list of passengers on records Data: route_list = actual passing sequences of routes on records Data: cluster_list = list of clusters that belong to each passenger Data: t_info = Sorted records of each passenger by time ascendance Data: origin_list = estimated rank number of origin cluster r = search range foreach passenger p in pass_list do foreach trip data pair tp in t_info of passenger do if tp[0].day== tp[1].daythen alight_stop = get_alighting(tp[1].boarding_stop)[0]else alight_stop = search_cluster(p, tp[0]) end end end Algorithm 2: function for reconstruction Function search_cluster(p, tp) target = cluster_list[p] route = route_list[tp.boarding_route]candidates = get_alighting(target[:2]) candidates = intersection(candidates, route) if len(candidates) == 2) then return after comparing boarding time, days of the week and origin_list[p] else if len == 1 then return candidates[0] else candidates = get_ alighting(target[2:]) return candidates[0] end end Function get_alighting(target) foreach center c of target do s = stop list near center within r candidates = distance(c, s) end return sort(candidates) end Function distance(c, s) foreach stop of s do append distance between c and stop to array euc_dist end return min(euc_dist) end

Figure
Figure 4. Clustering example of card ID 900,526,917,992.Total 10 clusters were produced.Two top ranks are positioned on the west side of the map.Rank 1 cluster is selected as a home position from the origin decision process.The farthest cluster from rank 1 cluster occupied the last rank number 10.

Figure
Figure 6.Accuracies of total 403,443 blind test data.The ICL achieved more than 20% higher accuracy compared to the two others.

Figure
Figure 5. Estimation success counts by error allowing ranges and by conditions of Table 3. c1 inferred from the consecutive stops, and c2&3 from picture means predicted from the last trip of a day.

Figure 7 .
Figure 7. Accuracies and missing estimation trend of each cluster rank (exact number only).Estimation is mainly from the top two clusters and the remaining clusters perform supporter roles, so we intentionally omitted the top clusters in this figure.

Figure
Figure 8. Accuracy changing trend by increasing clustering radius of two clustering methods, MS and ICL.

Figure
Figure 9. Accuracies of construction succeeds and spent time.The proposed method shows the highest points, but the slowest among the three methods.
Rui and Wunsch (2005)arranged clustering methods in various point of views.Clustering algorithms are roughly classified into hierarchical and partitional clustering.Hierarchical clustering operates in a bottom-up fashion by repeatedly agglomerating neighbor lumps.Whereas the partitional clustering breaks target data into a decided number of clusters.Clustering ways for large-scale data sets like transportation data can also be subdivided by working mechanisms.
Clustering Large Applications (CLARA) and CURE approach in random sampling, and Clustering Large Applications based on Randomized search (CLARAN) use randomized search.Condensationbased way has a well-known Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), and Density-based method has DBSCAN and Meanshift clustering.There are no right answers since clustering results can have different numbers of clusters and shapes depending on algorithms, and it is necessary to choose an appropriate algorithm suitable for a particular purpose.

Table 1 . Data fields and part of smart card records. Card IDs are encrypted in general, and check-in stop ID and check-out stop ID are in the same row in target records of this study
*Alighting stop ID may not exist depending on the policy or transaction device type of the city.

Table 3 . The alighting stop is decided depending on three conditions in the trip chaining method. If the first condition is satisfied, we decided the alighting stop as trip chaining method did. In case of the second and third condition, we used our method to determine the alighting stop Condition Candidate stops Meanings
λ 1 ; λ 2 are the longitude of point 1 and 2 in radian Therefore, those location values are transformed into Universal Transverse Mercator (UTM) orthogonal cartesian coordinate system while we construct the stop map then we could directly apply the simple Euclidean distance formula to get distances from the map data.A python package Pyproj is used for Coordinate transformation in this paper.

Censored raw data distribution of a dataset. The white dot in the middle repre- sents the median point. Around 75% of users have below 20 transactions during a month, and the median is near 10 tag- ging records.
Stop coordinates of each transaction spot of k th trip of user i on day j denote s i j;k , and trip history h i dt;nt is a tuple (t d t ð Þ; t n t ð ÞÞ; wheret d t ð Þisthetrip numbers of day, and t n t ð Þ is the count of records on day of user i.Then total records of user i can be represented as, H i ¼ h i d1;n1 ; h i d 2 ;n 2 ; . . .; h i dt;nt n o Figure 2.