Analyzing factors on tourist movement predictability: a study based on social media data

ABSTRACT The ability to predict tourist movements has various practical applications, including recommendation, target marketing, and destination planning. Predictability determines the limit of the prediction accuracy of data and models and helps us understand the factors affecting the prediction accuracy. We first constructed a conceptual framework of factors influencing the predictability from three perspectives: tourist, destination, and space-time. In this study, we focused on factors affecting the tourist movement predictability using data collected from social media at the city level. We used two prediction models to understand the impact of the factors on predictability. We further analyzed the relationship between the factors and movement predictability. The results of this study demonstrate that the length of the tourist itinerary and the spatial scale of the study are key factors that influence model selection. In addition, the results indicate significant differences in the predictability of tourists with different tourism motivations.


Introduction
Understanding the predictability of tourist movement plays a fundamental role in the tourism studies.The abilities to predict tourist movement can be helpful in a wide range of applications, such as tourism recommendation, targeted advertisement, and transportation optimization.As location-based services are increasingly used in tourism applications, they have accumulated a massive amount of location data.These novel movement observations, including GPS data (G.Lau and McKercher 2006;Solomon et al. 2021; Xiao-Ting and Bi-Hu 2012; W. Zheng, Huang, and Li 2017), mobile phone data (Raun, Ahas, and Tiru. 2016;Xu et al. 2021;X. Zhao et al. 2018) and social media data (Majid et al. 2013;Su et al. 2020;Y. Zheng et al. 2021), have proved many factors shape the predictability and regularity of tourist movement.
Although several studies have focused on the movement predictability of usual human living areas, the movement predictability of tourists has not yet been widely discussed.Clearly, the movement predictability between city residents and tourists have obvious differences.Some studies have shown that certain factors affect the ability to predict tourist movements in destination cities, including distance (Xue and Zhang 2020), number of visits (McKercher et al. 2012), travel party size (X.Zhao et al. 2018), weather (McKercher et al. 2015) and socio-demographic characteristics (Cantis et al. 2016).
One reason for the lack of research on tourist movement predictability is that previous studies have used data that contain little information about individual tourists.Previous studies have used survey data with small sample sizes (East et al. 2017;Xia, Zeephongsekul, and Arrowsmith 2009;Xia, Zeephongsekul, and Packer 2011), which have more complete information about individual tourists but are limited by the large workload of manual research and generally small sample sizes, or mobile phone data (Crivellari and Beinat 2020;Xu et al. 2022), which have sufficient data volume, but the location data accuracy is affected by the distribution of the base stations.In addition, with telecommunication companies protecting user privacy, the information of individual tourists is limited.
Because the predictability of tourist movement is context-dependent (Xu et al. 2022), different prediction models may have different interpretations.This study applied a deep learning method and a traditional prediction method to better understand the nature of tourists' movements.
The remainder of this paper is organized as follows.Section 2 overviews related work.Here, we also mention some studies in the field of computer and information science.Section 3 describes the step-by-step performance of the prediction task.In Section 4, we analyze the results of the prediction and evaluate its accuracy.Section 5 discusses the real-life applications of our prediction model.Finally, in Section 6, we conclude the paper and suggest possible improvements to this study.

Literature review
2.1.Tourist movement prediction Fennell (1996) noted that the mapping and modeling of tourist movements was a worthwhile topic to explore.However, in the beginning, few researchers attempted to model the actual movement of tourists (Lew and McKercher 2006).This has been argued to be the case because such movements are fundamentally obvious to some extent; thus, this aspect of research has often been neglected (Haldrup 2004;Urry and Larsen 2011).
According to how tourists move, tourist movement can be divided into two types: inter-destination movement, which refers to movement from an area that generates tourists to one or more destinations, and intra-destination movement, which refers to where tourists go and where they are within a destination.G. Lau and McKercher (2006) first used geographic information system (GIS) spatial analysis methods to study tourist movement within destinations.Subsequent studies have been carried out by researchers at different scales, including cities and sites.
Research on cities as destinations mostly concern tourism recommendation systems.These studies analogize the process of selecting the next attraction tourists visit to the process of consumers selecting goods from a recommendation system perspective, commonly using machine learning methods such as the Bayesian learning model (Subramaniyaswamy et al. 2015), support vector machine (X.Sun et al. 2019), and latent Dirichlet allocation (Shafqat and Byun 2019).With the further developments of machine learning techniques, more research has been recently conducted using deep learning prediction methods (Ameen et al. 2020;Crivellari and Beinat 2020).However, the general idea remains similar to that of machine learning and does not incorporate the specificity of tourists into movement for prediction.
Studies on tourist sites as tourist destinations began in 2009 (Xia, Zeephongsekul, and Arrowsmith 2009).In addition, Xia, Zeephongsekul, and Packer (2011) proposed an improved method that incorporated the time-dimension-based semi-Markov process two years later.These types of Markov models have failed to build a long-term dependency on tourist movements.Thus, W. Zheng, Huang, and Li (2017) developed a heuristic prediction algorithm (HPA) that considers the effects of historical locations on the prediction.However, the HPA still ignores tourist personal information and factors affecting tourist movement while mining trajectories.

Factors that influence tourist movements
Tourist movement prediction relies on the regularity of tourist movement.To obtain a deeper understanding of tourist movement for more accurate predictions, the influence of various factors must be analyzed.G. Lau and McKercher (2006) assumed that tourist movements can be affected by factors from three major aspects: human 'push', physical 'pull' and time factors.In addition, because the movement of tourists involves changes that occur simultaneously in time and space, we extend the temporal influences to spatiotemporal influences by combining changes in the spatial location.Therefore, our study divides the factors affecting the prediction of tourist movement predictability into three categories: tourist, destination, and space-time, as shown in Figure 1.
From the perspective of the tourist, the factors primarily refer to individual differences.Because novelty and unfamiliarity are crucial for tourists and as these two attributes of the tourist experience vary individually, tourists' individual differences are a main reason for the differences in individual movement.First-time and repeat visitors are a major topic.Oppermann (1997) noted that first-time visitors tend to visit more locations than repeat visitors.The studies by Lehto, O'leary, and Morrison (2004) and Xia et al. (2008) and subsequent studies also confirmed this (Lehto, O'leary, and Morrison 2004;Smallwood, Beckley, and Moore 2012;Xia et al. 2008).In addition, the distance was determined to be a critical factor related to tourists' movement.Wang, Little, and Del-Homme-Little (2012) determined that tourist stay longer if they live far from the destination.Xue and Zhang (2020) determined that tourists who live farther from a destination prefer the city's cultural heritage and famous sites.Other factors, such as gender (Xia et al. 2010), age (Driver 1974), travel party (X.Zhao et al. 2018), and cultural distance (Dejbakhsh, Arrowsmith, and Jackson 2011;Flognfeldt Jr. 1999) also affect the movement of tourists.
The factors from the perspective of the destination involve the attractiveness of the destination to tourists (Shen et al. 2023;Zhong, Sun, and Law 2019).The characteristics of the destination have a 'pull' effect on tourists, which influences their decisions on the choice of itineraries within the destination and leads to different movement patterns.And the uniqueness, variety, number, and distribution of attractions affect tourist movement (G.Lau and McKercher 2006).Transportation between destinations has also been proven to be closely connected to the areas visited by tourists (Le-Klähn et al. 2015).The emergence of shared bikes, which are a novel type of public transportation, affects the movement of tourists in tourist destinations (Y.Yang, Jiang, and Zhang 2021).In addition, the configuration of services and facilities in a tourist destination can also influence on the movement, such as the hotel location (W.Zheng et al. 2020) and guide centers.
The spatiotemporal factors are divided into space and time factors.The factors of space primarily refer to tourist movement.Tourist movement in space leads to changes in tourists' locations.By connecting the location points with timestamps in sequential order, we can obtain the spatiotemporal trajectory of the tourist.Thereafter, by analyzing the trajectories of tourists, we can extract spatiotemporal features from the trajectories.These spatiotemporal features show the characteristics of the tourist's movement in space, and each movement is inseparable from time.Xu et al. (2022) examined the connection between the trajectory length and movement predictability.Tourists act as as if they maximized the expected utility of the remainder of the trip (Västberg et al. 2020).The time has a decisive impact.G. Lau and McKercher (2006) noted that the time scheduling and length of stay are the two main factors that influence tourist movement.Typically, longer stays result in a greater travel area (Koo, Wu, and Dwyer 2012).Le-Klähn et al. (2015) drew similar conclusions in a study on Munich.Lew and McKercher (2006) suggested that when a tourist travels within an urban area, the origin point of the tourist's morning trip and destination point of the late afternoon trip may both be the tourist's residential area.On a larger time scale, holiday and seasonal factors are also important influencing factors.For seasonal factors, Jankowski et al. (2010) determined a lack of clear seasonal dependency in the frequency and spatial direction of tourists' movements.By contrast, Yun, Kang, and Lee (2018) determined meaningful differences in the spatiotemporal distribution of urban walking tourists by season using GPS data.

Study area and dataset
We used social media data collected from Suzhou, China.Suzhou is a city with five million residents in eastern China, west of Shanghai (Figure 2).Suzhou, a city with plentiful tourism resources, received over 100 million domestic visitors a year before the COVID-19 pandemic.Suzhou is famous for its cultural and historical heritage.The most represented sites in Suzhou are the classical gardens, which were included on the World Heritage List in the last century.Ancient city sites in Suzhou cover an area of 14 square km.In addition to these historical sites, Suzhou embraces natural landscapes with lush mountains and gleaming lakes.
Social media data were collected mainly from location-based social network mobile phone applications.Sina Weibo is the most popular social media platform in China, with 340 million active monthly users (X.Liu and Hu 2019).The microblogs with location information posted by Sina Weibo users are typically classified into two types: check-in microblogs, which contain check-in information and geographical information, and geo-tagged microblogs, which contain only the location from which users post the microblogs.
In our study, we first selected all points of interests (POIs) related to tourism sites.Then, we used the application program interface provided by Sina Weibo to collect the selected POI check-in microblogs.Following this, we proceeded to collect their historical microblogs from recent years.Upon acquiring users' historical microblogs, we filtered those that were not within the area of Suzhou.As shown in Figure 2(d), there are 470,041 users and 5,399,161 microblogs.

Tourist identification and trajectory extraction
For social media data (such as Twitter, Flickr, and Sina Weibo), only a portion of the users were actually involved with tourism activities.The purpose of tourism activity could be entertainment or relaxing visits.However, it could also be official or business visits.Both are eligible for tourism activity, although visitors traveling at their own willingness were considered only as tourists in our study.In addition, we stipulated that tourists should not be locals, and tourists must return to the residential city after the trip to the destination city.Therefore, we defined only those users who post microblogs within the sites as tourists while data preprocessing.
To collect only those users that fit our definition of a tourist, we first involves determining the location of the user's residential city.We investigated only overnight visitors instead of one-day visitors.We can then separate the user's travel segments.A residential city is an important node in a user's entire travel trajectory.We use the residential city as a split sign to divide a user's annual travel trajectory into several relatively short trips in which the start and end nodes include both residential cities.However, not all trips pay a visit to a tourist site.Thus, we must distinguish whether the user has visited tourist sites during this trip.A clustering method was used to obtain the boundaries of each site.If a user's microblog location was within the boundary of any tourist site, we assumed that the user was a tourist on this trip.
(1) Determine the user's residential city A residential city represents the people's usual environment.Only some Sina Weibo users chose the right city when registering.However, some users' city information were inaccurate.We applied an easy method for determining a users' residential city based on the information entropy introduced by Claude Shannon (Shannon 1948).The city with the maximum entropy value can be regard as the city of residence for each user.
For a certain user U, user-visited cities can be represented as a collection C ={c 1 , c 2 , . . ., c n }.For any city c k in collection C, the number of microblogs posted in city c k is N k .We get the number n k,m of the microblogs that the user posts each month, and calculate the entropy according to the following formula: The entropy value E k of the city c k is calculated, and the city corresponding to the maximum entropy value is selected as the resident city of the user.According to the principle of information entropy, the more balanced the data distribution is, the higher the entropy value is; the longer a user stays in his/her usual city, the more evenly the number of microblogs are distributed throughout the year, which corresponds to the highest information entropy value.And if the user posts too few microblogs, there will be no valid results.
(2) Split trajectories by trips The user's movement trajectory over the course of a year can be considerably long.According to the definition of tourism from the UNWTO, a tourist must leave their usual environment to perform tourism activities.Therefore, we used the residential city as a split sign to break a long trajectory into short-term trips.In particular, we regulated a user's trip according to the following rules.
(a) Residential Rule The residential location is the origin of a user's travel history.In our study, we assume that all users begin and end their trip at the residential city.The breakpoint of the entire trajectory is a residential city.
(b) Time Gap Rule This rule aims to ensure that users' microblogs are sparse.When a user's microblog temporal distributions are not uniform, we cannot use the residential city as the only breakpoint.The residential city is assumed to be the city where the user returns when the trip ends.If the user's two successive microblogs are both located in the destination city and the time between the two microblogs is more than three days, we identify these two microblogs belong to different trips.The reason we set the time threshold as three days is based on the general railway time, as most railway trips between two cities cost less than three days.
To clearly show how to divide the trajectory, we used the sequence shown in Figure 3.According to this sequence, the user has microblogs in the user's residential city on days 1, 9, and 12.In addition, no microblogs were posted on days 3, 7, 15, 17, 18, or 19.The remaining days observed microblogs posted from the destination cities. Applying the two rules above in this sequence, we can obtain three trips: (1) Start on day 2 and end on day 8.The user is in the residential city on days 1 and 9.Although the user posted no microblogs on days 3 or 7, with the time-gap rule, we assume that this is still the same trip that began on day 2. (2) Start on day 10 and end on day 11.The user returns to the residential city on day 12 and posts microblogs every day during the entire trip.
(3) Start on day 13 and end on day 16.The user posts a microblog on day 20; however, the time gap between the last day with microblogs (day 16) was over three days ago.Thus, we assume that this is not the same trip that began on day 13 according to the time gap rule.
These steps finally led to a total 282,532 tourists who visited one or more of Suzhou's tourist sites, and the total number of tourism-related activities (posts) was 1,611,269.These tourists contributed 568,465 trips during the time period we collected.

Location prediction models
Previous studies predicted people's next move by recognizing the regularity of historical movement.The Markov model has been frequently used to model the human movement and predict the next movement (Asahara et al. 2011;Ashbrook and Starner 2002;Gambs, Killijian, and del Prado Cortez 2012;L. Song et al. 2004;J. Yang et al. 2014).The Markov model uses probability to model the location transition process of users.The Markov process assumes that every subsequent location depends on the current location.Thus, based on the user's past transition probability between locations, the next location can be inferred by a probability calculation.However, the Markov model can establish only connections between adjacent locations, while ignoring the long-term dependence of locations.
Recently, deep learning has become a popular topic in almost all study areas.A recurrent neural network (RNN), which is a popular neural network, was first designed to process long sentences in text and speech tasks.The tourist trajectories are similar to text sentence because they are both inputted as a sequence.Long short-term memory (LSTM), a variant network architecture derived from the basic RNN, has been used in location prediction studies (Bao et al. 2021; Kong and Wu 2018; X. Song, Kanasugi, and Shibasaki 2016; K. Sun et al. 2020).Basic RNNs often trigger gradient disappearance or explosion owing to their simple structure, and the remembered time-step information is considerably limited.A gate structure was introduced in LSTM units, and this structure is effective in improving the gradient problem and selectively remembering or forgetting historical time-step information; thus, it is called a long short-term memory network.
As shown in Figure 4, the LSTM model we used consists of an embedding, LSTM, fully connected, and SoftMax layer.The model used in this study is based on LSTM, where we could add the tourist's personalized features to the network.When dealing with a trajectory in the dataset, the first step involves uniformly aligning its length.Then, the locations of each site within the trajectory are sequentially fed into the model and transformed into low-dimensional embedding vectors.To accomplish this, we use the Word2Vec to embed the sites.The Word2Vec is one of the language models that learns semantic knowledge in an unsupervised approach from a large number of textual corpus and is widely used in natural language processing (Le and Mikolov 2014;Mikolov et al. 2013;Rong 2014).In our model, we treat location names as words and use Word2Vec to train a vector representation for them.Subsequently, the embedding vector is concatenated with the personalized features (which is optional) before being inputted into the LSTM layer.Following each time step's output, it is then processed sequentially through the Fully connected layer and Softmax layer.Softmax turns the output of a neural network into a probability distribution of locations.The model utilizes the location with the highest probability as its prediction output.
During the training process, the prediction results of each time step are involved in the calculation of the loss, whereas in the testing process, only the prediction results of the last time step are used.This is because, in the training phase, maximum autoregression can be performed with the data.By contrast, in the testing phase, when a user's historical trip has been determined, we focus only on the prediction results of the current location.The loss of the training and testing stage is illustrated in Figure 5.It can be seen that our model is not overfitted.
To consider the effect of different models on tourist predictability, this study designs comparative experiments for the Markov and LSTM models to analyze the effect of the length of the tourist trajectory and distance traveled by the tourist within the destination based on the predictability.

Influential factors of movement prediction
Using techniques in representation learning, deep learning can learn and extract rich features from a wide variety of unstructured datasets such as of text, images, and sounds.In the study of location prediction problems, previous researchers typically considered features other than the trajectory features in the dataset as input for the model, such as geographical (Feng et al. 2017;S. Zhao, King, and Lyu 2013), temporal (Gao et al. 2013;Q. Liu et al. 2016;Yuan et al. 2013), and demographic (Solomon et al. 2021) features.By adding these features, the accuracy of the model prediction was further improved.In this study, we used the factors influencing tourist movement predictability mentioned in Section 2.2 to analyze the impact of these factors on the predictability of models commonly used for location prediction.Using the dataset composed of social media content, we calculated and extracted different features from three perspectives: tourist, destination, and space-time.From the tourist perspective, we extracted the tourist's travel distance between the origin and destination (OD), number of visits, and gender.From the destination perspective, we tagged all 108 tourist sites using three types of labels.We introduce these labels in Section 4.2.For the spatiotemporal perspective, we first calculate the length of tourists' trajectories in Suzhou and travel distance between sites (S) as features of spatial movement, and then extract the temporal features of tourists using three scales: season, holiday, and time of arrival.
For the LSTM model, trajectory-related information can be input for the model to self-learn and mine potential movement patterns of tourists.Therefore, this study inputs tourist, destination, and spatiotemporal features into the LSTM model and analyzes the impact of the features on predictability.To make our results convincing, we ran the experiment 10 times for each factor.We split the dataset into 10 sets.For each run, we picked a different set for testing and used the remaining nine sets to train the model.

Evaluation metrics
The two metrics used for prediction accuracy include Top@k and the mean reciprocal rank (MRR): (1) Top@k -Top@k is a common evaluation metric in prediction and recommendation studies.In general, the output of a prediction model is a list of values ranked in the order of the predicted score.Thus, Top@k indicates that a prediction is considered correct as long as the true label is associated with one of the k highest predicted scores.Top@k is calculated as follows: |Test| represents the number of test data.ˆfi,j is the predicted value for the i th sample of test data corresponding to the j th largest predicted score and y i is the corresponding true value.1(x) indicates the indicator function, which means that when the input is True, the output is 1, otherwise the output is 0. In our study, we used Top@1 and Top@3 as evaluation metrics.
(2) Mean Reciprocal Rank (MRR) -This indicator considers the ranking of the true values in the prediction results, using the inverse of the ranking as a weighted weight and harmonic mean of the prediction probabilities.MRR is calculated as follows: where |Test| represents the number of the test data, and rank i represents the rank of the true value in the prediction result for the i th test data point; the larger the value of MRR, the higher is the ranking of the true value, and the better is the prediction model.

Predictability based on tourist factors
In this subsection, we focus on the impact of predictability by analyzing the tourist influential factors (as mentioned in Section 3.4) using the LSTM model.Using the methods mentioned in Section 3.2, we identified a tourist's city of residence.The number of tourist visits to Suzhou can also be determined by the tourist trajectories.The tourists' gender information can be acquired by their Sina Weibo profiles.
We acquired three features of tourists: travel distance (OD), number of visits, and gender.
The results imply that the model has different predictabilities for tourists with different features.As shown in Figure 6, this study further analyzed the prediction accuracy of the LSTM model for the different features of the tourists.To ensure whether the results had significant differences, each tourist feature was used 10 times.Different training and test datasets were used each time.
. Travel distance (OD) -The prediction accuracy falls with the range of the travel distance (OD).
The distance from 0 to 100 km includes several large cities near Suzhou with strong economic ties, such as Shanghai, Wuxi, and Huzhou.The distance from 100 to 500 km covers the entire Jiangsu province and some areas in neighboring provinces.Tourists with a distance of less than 500 km are highly predictable, as shown in Figure 6(a).The distances from 500 to 1200 km cover most of the eastern coast of China, including megacities such as Beijing and Guangzhou.Tourists from these areas are less predictable.In addition, tourists from remote areas with distances over 1200 km were difficult to predict. .Number of visits -When a tourist arrives to Suzhou for the first time, their predictability is steadily lower.However, once a tourist becomes a familiar visitor to Suzhou, their predictability significantly increases, up to twice that of even a first-time visitor (Figure 6(b)). .Gender -Gender is an important demographic characteristic.A difference in preference was observed between genders in choosing tourist sites.Based on the results using our dataset, as shown in the Figure 6(c), the predictability of females is slightly higher than that of males.

Predictability based on destination factors
Similar to the tourist features in the previous subsection, we embedded the destination features into the model in the same manner.In this subsection, we consider the effects of destination factors on the predictability.Tourist sites are a core component of destinations.Therefore, based on all the sites visited by tourists in our dataset, we categorized the sites from three perspectives: site location (located in urban or rural areas), site type (natural, cultural, commercial), and site title (5A, 4A, UNESCO).It's worth stating that there are five levels of sites in China: 1A, 2A, 3A, 4A and 5A.
A tourist site with a 5A score implies that it has the most beautiful scenery, the best service and perfect facilities.Then the same experiments as in the previous subsection were conducted, and Figure 7 shows the results. .Site title -Site title reflects the tourist site's reputation in the domestic and international tourism market.More well-known tourist sites have a broader tourist market and attract a wider range of tourists.From Figure 7(a), we note that different site titles have similar predictability.Sites with the 4A title tend to have a higher predictability. .Site type -The site type determines the main scene of the sites.Natural sites often refer to those sites with beautiful mountain scenery.Cultural sites refer to temples, gardens, and museums that have humanistic and historical features.Commercial sites refer to the commercial complex that evolved from the historical old street, featuring shopping, leisure, and dining functions.From Figure 7(b), we note that the commercial sites of Suzhou have a high predictability, for which the Top@1 is over 0.5.Natural sites were the second-most predictable.However, cultural sites, which are the main site type of Suzhou, have the lowest predictability. .Site location -The circular highway around the urban area of Suzhou is called the Zhonghuan (Central Ring) Road.We used this highway as a dividing line, dividing Suzhou into urban and rural areas.The urban area covers 5% of Suzhou and contains 40% of the tourist sites.Figure 7(c) shows that the predictability of urban areas is significantly higher than that of the rural areas.

Predictability based on spatiotemporal factors
From the experiments in the previous subsection, we know that predictability varies with the tourists and destinations.Thus, in this section, we focus on more dynamic factors, including spatial and temporal factors.From the microblogs posted by tourists, we extracted three temporal features to the best of our knowledge: time of arrival, holiday, and season.We believe that the model has different predictability for tourists with respect to time factors.Therefore, the same analysis as in the previous subsections was performed, as shown in Figure 8.
From Figure 8, we determined the following: . Time of arrival -Arrival time can reflect the type of tourist site to an extent; for example, tourists are mostly active in the daytime at sites with natural scenery.In Figure 8(a), tourists' behavior during the day is less predictable, whereas behavior at night is relatively highly predictable.The reduced prediction range may reduce the difficulty of prediction owing to the few sites in Suzhou that are suitable for nighttime visitations. .Holiday/Workday -Generally holidays typically attract more tourists.However, from the dataset we used, this trend has an almost negligible impact on predictability.Alternatively, the predictability of weekdays and holidays does not differ for Suzhou.That is, for tourist sites in Suzhou, the movements of tourists during weekdays and holidays do not differ significantly. .Season -Several previous datasets lacked long-term datasets, making the study impossible from a seasonal perspective.As shown by the results of the experiment on the dataset (Figure 8(c)), some variation is observed across seasons.The predictability is relatively low in the spring and autumn.The predictability of summer is slightly higher than these two seasons, and the most predictable season is winter.
Apart from three time factors, we extracted the spatial movement characteristics of tourists by analyzing tourists' trajectory data and the influence of tourists' movement.Moreover, the characteristics of spatial movement are a more continuous amount of change compared with the previous characteristics of tourists and destinations.Therefore, we used the Markov and LSTM models as controls, representing the traditional prediction and deep learning models, respectively.We used the number of tourist trajectory nodes and length of tourist trajectories to examine how the predictability of the two models changed with these two aspects.Here, the number of nodes in the trajectory is the trajectory length.Figure 9 shows the distribution of the trajectory nodes.From the figure, 90% of tourists visit no more than six sites on one Suzhou trip.
Similar to the length of the trajectory nodes, the travel distance (S) mentioned in this study is the sum of the distances between sites corresponding to the remaining trajectory nodes, excluding the last node of the entire trajectory (the last node is the predicted object).Because the distribution of tourists' travel distance (S) is closer to the long-tail distribution, we take the logarithm of the distance to better show the distribution of each distance interval, as shown in Figure 10.From the figure, 90% of tourists visit no more than 80 km on one Suzhou trip.
Based on the trajectory length and tourist travel distance (S), we explored the relationship between them and the predictability.First, for the tourist trajectory length, as shown in Figure 11, the prediction accuracy of the LSTM is obviously higher than that of the Markov model.Both models have lower prediction accuracy at 11 nodes, as shown in the distribution chart (Figure 9), probably because of the small sample size of the data.
As the trajectory length increases, the prediction accuracy of the LSTM gradually increases (except the outlier at 11 nodes).The main improvement in the prediction accuracy is reflected by the Top@3 and MRR metrics, whereas the improvement in the Top@1 metric is insignificant.This may indicate that the input trajectory length is not a decisive factor that directly affects the prediction accuracy.The prediction accuracy of the Markov model falls sharply with three nodes and then repeatedly oscillates with the increase in the trajectory length without any significant increase.From this result, we can easily see that the model can more clearly learn the movement patterns of tourists from longer trajectories and thereby improve the prediction accuracy.The Markov model is still a good prediction model when the tourist trajectory is short, which is why many inter-site-scale tourist prediction models use Markov models and their variants (Xia, Zeephongsekul, and Arrowsmith 2009;Xia, Zeephongsekul, and Packer 2011).
For the experiment on the variation in the tourist travel distance (S), we conducted a similar experiment, and Figure 12 shows the results.As the tourist travel distances (S) appears to follow a long-tailed distribution, the x-axis in the figure is taken logarithmically for distances, and the prediction accuracy is calculated separately based on equal intervals, which are taken on a logarithmic scale.
As shown in the figure, the prediction accuracy of the two models is close when the travel distance (S) of tourists is less than 1 km.However, when the distance traveled by tourists exceeds 1 km, the prediction accuracy of the Markov model gradually decreases or even converges to 0. In  contrast, LSTM is not affected by the increase in distance, and the MRR is always above 0.4 and has a relatively high prediction accuracy when less than 1 km.Such results may suggest that LSTM has the potential to better establish connections between sites with longer trajectories than the Markov model and that LSTM may uncover implicit connections between sites with longer geographical distances.This also demonstrates that black-box methods, such as LSTM are more capable of uncovering more potential interconnections of tourist movements than white-box methods based on human cognitive perspectives.
In addition to the overall movement of tourists within a destination, we focused on the local movement of tourists between sites.To this end, we calculated the distance and MRR between the predicted site and its previous sites.Figure 13 shows the distributions of the logarithmic distance and MRR.The vertical line at each point represents the 80% confidence interval of the MRR distribution at that point.
The distribution of the confidence intervals shows that for equal logarithmic intervals, the predictability first rises and reaches an extreme value (0.35) at a distance of approximately 10 0 to 10 0.1 (≈1-1.25)km.Subsequently, the predictability gradually decreases and approaches 0. From these results, we infer that the predictability of tourists is not monotonic with distance.That is, the range of tourist activity is closely related to the predictability of the tourists.When the activity range of tourists is small, the predictability is not too low because the number of sites within a small spatial area is also small; however, this does not imply that closers distances between two tourist sites improve the predictability of the tourists.When the range of tourists' activities gradually increases to approximately 1-1.25 km, the predictability of tourists is at its highest, and obtaining more accurate results is easier when recommending sites to tourists.When the activity range is further increased, tourists' choices are further expanded; thus, the predictability is further reduced to 0. Our analysis leads to a possible reason for this threshold of optimal predictability, which is the point at which the most of the tourist transportation change significantly.Once this spatial range is exceeded, tourists switch from walking to transit or driving.

Predictability of models
Through several experiments, we determined a difference between models in predicting a certain prediction accuracy.The advantages and disadvantages of different models are evident.For the Markov model, the model is a white-box model with straightforward assumptions.The Markov  model has achieved good prediction results on a relatively small scale.However, when the model is confronted with longer and more complex trajectories, its prediction ability decreases substantially, as shown by the results in the previous section.
On the contrast, the LSTM model has a particularly good prediction accuracy, even for long trajectory inputs.In addition to learning the hidden tourist movement regularity in tourist trajectories, LSTM can consider the influence of other factors on tourist movement predictability.By adding features corresponding to the influencing factors, we determined that the features can further improve the prediction accuracy.
The length of a tourist's trajectory is not equal to the travel distance (S) of the tourist at the destination.The results indicate that the accuracy of the two models is similar for tourists with short travel distances (S).This suggests that employing complex prediction models like the deep learning model may not be advantageous when the prediction's spatial scale is limited.Furthermore, tourists constrained by limited movement often opt for short stays and are commonly referred to as shorttrip tourists.As a probability-driven model, the Markov model tends to outputting tourist sites with higher probabilities, signifying greater visitation and popularity.Therefore, the Markov model's effectiveness with short-trip tourists suggests that its predictions align with the concept of experienced individuals providing recommendations to short-trip tourists, highlighting the most popular sites.
The practical significance of these findings is to provide tourism management with a basis for example-based prediction model selection.Indeed, various models are now available for tourist prediction.For decision-makers in tourism management, choosing an appropriate model and method for the situation of the destination they manage is crucial.Because the datasets, study areas, and study scales used by different researchers differ, decision making and understanding is easier from the perspective of combining the theoretical and practical models proposed by previous tourism researchers in this study to choose more suitable and smarter solutions for destination management organization.

Predictability of tourists and destination factors
The results of this experiment show that not all features improve the prediction results for a real-life dataset.Moreover, the contribution of a single feature to a deep learning prediction model is limited, and researchers cannot rely on certain features to improve the prediction accuracy of a model.For factors that improve the prediction results after inputting features into models, the main improvement is reflected by the Top@1.
The factors from the tourist perspective show the effects of individual differences on predictability.In this study, we used three factors: travel distance (OD), number of visits, and gender.The predictability decreases as the travel distance (OD) increases.According to Xue and Zhang (2020), long-haul tourists prefer cultural sites, followed by natural and commercial sites (Xue and Zhang 2020).Even when considering cultural sites, the predictability remains low.Our study proves that long-haul tourists in Suzhou are more likely to choose cultural sites.The number of visits reflects the tourist's prior visit experience of Suzhou.As Oppermann (1997) noted, the movement patterns of New Zealand repeat visitors visit fewer locations and are more concentrated in their itinerary.We obtained similar results on the Suzhou dataset, as shown in Figure 6(b).For the gender factor, we determined that females had a higher predictability.Gender features showed the highest improvement in terms of prediction accuracy for the LSTM model.By comparing the differences in the spatiotemporal movements and site choices between the two genders, we determined that a higher proportion of females visited commercial attractions (36%) among the three main attraction types (natural, cultural, and commercial), slightly higher than that of males (33%).Kotzé et al. (2012) noted that women obtained more gratification from shopping than men.As commercial sites are more predictable, this may contribute to female tourists being more predictable than male tourists.
To further discuss why some tourists have a high predictability, we further analyzed tourist preferences for choosing sites, as shown in Figure 14.As shown in the figure, tourists' preference for the site type affects the predictability of tourists.When the travel distance is 0-100 km, the tourist prefers to visit commercial sites.Tourists prefer cultural sites when the travel distance is greater than 1200 km.Note that these two ranges of travel distances correspond to the highest and lowest predictability, respectively, according to the results in Figure 6.For tourist factors such as the number of visits and gender, tourists with a higher predictability (repeat-visit and female tourists) similarly prefer commercial sites (Figure 14(b,c)).Commercial sites constitute a small fraction of Suzhou's overall sites, with well-known examples including Guanqian Street, Pingjiang Road, and Jinji Lake (Figure 15).These sites offer a diverse range of tourist services, including accommodation, dining, shopping, and recreational activities.Consequently, accurately predicting tourist visits to these commercial sites within Suzhou's site type distribution is not a challenging task.According to Figure 14, visits to commercial sites are mainly revisited and short-haul tourists in Suzhou.These tourists' primary motivation revolves around leisure and relaxation, rather than sightseeing.These findings support those of previous studies (McKercher and Du Cros 2003; Xue and Zhang 2020).

Predictability of spatiotemporal factors
Time is important for tourists.As with most tourist budgets, this is a limited resource.Therefore, time significantly limits tourists' movement to a large extent.
The time of arrival reflects the main method of visiting sites.For example, tourists are unlikely to choose to visit magnificent natural sites at night because the visibility is too low.Figure 8(a) shows that the highest predictability is in the evening.Tourists are difficult to predict during the day.The sparsity of the dataset at midnight caused the predictability to be unstable.These results imply that tourists in Suzhou are clustered more at night.
Holiday refers to various activities that tourists can participate.Some studies used holiday data to ensure that most users in the dataset were tourists (Y.Yang, Li, and Li 2019).However, this approach can lead to biased data that hardly demonstrates the tourists' movements during weekdays.We demonstrated that the predictability of holidays and workdays was insignificantly different.
Seasons are another important factor in tourist movements.Previous studies have used data over a short time span, which does not provide a good representation of the seasonal changes in tourists' movements.Suzhou has a subtropical monsoon climate with hot summers, abundant rainfall, and cold winters with less rainfall.Popular types of tourist sites in Suzhou are primarily gardens and ancient towns.Thus, the best seasons for visiting Suzhou are spring and autumn.As shown in Figure 8(c), this increases the difficulty of predicting tourist movement.The winter has a relatively higher predictability, probably because of the fewer winter attractions in Suzhou.The spatiotemporal factors extracted from tourist trajectories provide a new interpretation of tourist prediction.While recent deep learning methods have continued to improve the prediction accuracy on datasets, few models have attempted to improve the interpretability of the models.We attempt to correlate the features with the model predictability by analyzing the spatiotemporal features of tourists, as described in the results in Section 4.3.The findings hold practical implications for the customization of tourism products and the planning of tourist sites.Regarding overall movement, long itinerary tourists pose significant challenges in terms of prediction accuracy.This indicates that current models face difficulties in meeting the specific requirements of long itinerary tourists.Consequently, it is crucial to prioritize personalization and customization of tourism products tailored to the needs of long itinerary tourists.In terms of local movement, the extreme point of predictability shown in Figure 13 indicates that this is an important threshold (1-1.25 km) for tourists making their decisions to visit subsequent sites.Therefore, neighboring sites within 1 km are more suitable for bundle marketing.In addition, relatively cold sites should consider using the heat of popular sites within this range for site promotion.Furthermore managers can situate their tourism-related services and facilities within this spatial range to offer better tourism services.

Conclusion
This study classifies and summarizes the factors influencing tourist movement predictability in the past and proposes a conceptual framework of influencing factors based on the theory of G. Lau and McKercher (2006), as shown in Figure 1.This study used social media data to explore tourist movement predictability from a multi-subject and multi-factor perspective.By comparing the models, features, and tourist sites, we obtained comprehensive and detailed analysis results of the tourist predictability, as shown in Table 1.By applying prediction models and evaluating their  The Markov model is unsuitable for larger spatial scales When using model prediction, the spatial scale must be considered Xia, Zeephongsekul, and Arrowsmith (2009); Xia, Zeephongsekul, and Packer (2011); W. Zheng, Huang, and Li (2017) predictability, we identified some consistencies with the findings of previous studies.We determined significant differences in movement predictability for almost every factor and discussed possible reasons.We then proposed constructive suggestions for Suzhou's tourism management and decision-makers, considering the actual tourism situation in Suzhou.This study has a few limitations.The dataset used in this study was Sina Weibo, in which data bias may be difficult to avoid.This is because the user base of Sina Weibo is dominated by a younger user group.The results are not representative of the mobile-mode behavior of children and the elderly.Future studies should consider combining social media data with questionnaire data to improve the reliability of the data source.
In conclusion, we used social media data to analyze the factors that affect tourist movement prediction.We offer evidence to support the knowledge claimed by previous studies and provide a new theoretical basis for constructing tourist movement prediction models.In addition, we determined that the predictability of the LSTM model varies with different tourists, times, and sites.Therefore, future models that aim to further improve the prediction accuracy must consider the contributions of different features in the current case.New prediction models have increasingly improved in recent years; however, few of these models have been combined with existing knowledge obtained from previous studies.This study explored the impact of features generated by influential factors on tourist prediction models and attempted to explain this impact.Future prediction models can incorporate travel costs between locations, thereby achieving models characterized by enhanced accuracy and interpretability.Additionally, the order of the visited locations should also be considered.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 2 .
Figure 2. Study area and Sina Weibo dataset.(a) The location of Suzhou on the scale of Country.(b) The location of Suzhou on the scale of Province.(c) The location of sites in Suzhou and (d) The geo-tagged microblogs in Suzhou.

Figure 3 .
Figure 3. Diagram of trajectory division.Note:The number in the circles is the date index.The circles indexed as 1,9,12 are the dates when the microblogs posted in the tourist's residence city.The circles indexed as 3,7,15,17,18,19 are the dates when microblogs were not posted.The other circles are the dates when microblogs were posted in the destination city.

Figure 5 .
Figure 5.The training loss and testing loss of our model.

Figure 6 .
Figure 6.Boxplots of Top@1, Top@3, and the MRR for the prediction results of the LSTM model with tourist features.

Figure 7 .
Figure7.Boxplots of Top@1, Top@3, and the MRR for the prediction results of the LSTM model with destination features.

Figure 8 .
Figure8.Boxplots of Top@1, Top@3, and the MRR for the prediction results of the LSTM model with time features.

Figure 11 .
Figure 11.Length of tourist trajectories and the Top@1, Top@3, and MRR of the prediction.

Figure 12 .
Figure 12.Travel distance (S) of tourists and the Top@1, Top@3, and MRR of the prediction.

Figure 13 .
Figure13.Relationship of the distance between sites and MRR at the local level.

Figure 14 .
Figure 14.Preference of tourist with different features for site types.

Table 1 .
Summary of the impact of factors on tourist movement prediction.