Qualitative analysis of housing demand using Google trends data

Abstract Big data analytics often refer to the breakdown of huge amounts of data into a more readable and useful format. This study utilises Google Trends big data as a proxy for an analysis of housing demand. We employ a qualitative method (fuzzy set/Qualitative Comparative Analysis, fsQCA), instead of a quantitative method, for our estimate and forecast. The empirical results show that fsQCA successfully forecasts seasonal time series, even though the dataset is small in size. Our findings fill the gap in the qualitative and time series forecasting literature, and the forecasting procedure herein also offers a good standard for industry.


Introduction
One potential cause of inefficiency in a housing market is a lack of liquidity (Tsai & Tsai, 2018). Building up a housing demand index could help understand why such liquidity may not arise at certain times. With Internet usage now rather common worldwide, vast amounts of consumer data are now available. A question thus arises: Can we analyse big data to forecast housing demand?
Big data technologies are naturally very useful when it comes to storing and processing huge sums of data (Diaconita, 2015). Wu and Brynjolfsson (2015) showed that queries submitted to Google's search engine correlated strongly with both the volume of housing sales as well as a house price index released by the Federal Housing Finance Agency in the U.S. Hence, this study uses Google Trends data as the forecasting target, because the data are collected from the Internet and are readily available to the public.
Housing demand exhibits seasonality, as shown, for example, in Bangladesh (Ahmad, 2015) and the U.S. (Wu & Brynjolfsson, 2015). Other studies have applied a seasonal adjustment to model the housing demand problem, such as that in New Zealand (Grimes & Aitken, 2010), in Central, Eastern, and Southeastern Europe CONTACT Kun-Huang Huarng khhuarng@ntub.edu.tw (Vandenbussche, Vogel, & Detragiache, 2015), and in Ireland (Addison-Smyth & McQuinn, 2010), to name a few. Fuzzy set/Qualitative Comparative Analysis (fsQCA) is a variant of QCA (Roig-Tierno, Gonzalez-Cruz, & Llopis-Martinez, 2017). It reveals the sufficient conditions that lead to a specific outcome (equivalent to a dependent variable in multiple regression analysis, MRA) (Woodside, 2013;Woodside, Nagy, & Megehee, 2018). The sufficient conditions can be combinations of antecedents or independent variables in MRA. Hence, to tackle seasonal time series like those existing for housing demand (Badun, & Frani c, 2015), this study thus (1) uses fsQCA to autoregressively model a time series of order k, AR(k), as the antecedent and the next time period's data as the outcome and (2) employs the model to forecast the AR time series.
The rest of this study is organised as follows. Section 2 reviews the relevant literature. Section 3 introduces how to use Google Trends data as a big data proxy for housing demand, fsQCA, and the model set-up. Section 4 discusses the empirical analyses. Finally, section 5 concludes this study.

Literature review
FsQCA has been applied to solve various types of problems. For example, Trueb (2013) utilised fsQCA to integrate qualitative and quantitative data to create useful index for economic and policy development. Vis, Woldendorp, and Keman (2013) employed fsQCA to examine the variation in economic performance. Denk and Lehtinen (2014) used both Qualitative Comparative Analysis (QCA) and fsQCA to conduct contextual analysis of mobilisation of minority. Huarng (2015) employed fsQCA to analyse relationships within the development of information and communication technologies in more than 100 countries. Huarng and Yu (2015) used fsQCA to explore the sufficient conditions for the outcome of healthcare expenditure from various antecedent combinations such as longevity, number of doctors, aging population, etc. FsQCA provided antecedent combinations for each year to show that causal complexities lead to highly consistent outcomes. The analysis in that study also showed strong predictive validities. Rey-Mart ı, Ribeiro-Soriano, and Palacios-Marqu es (2016) also used both QCA and fsQCA to analyse culinary tourism success and entrepreneurial attributes under human capital and contingency factors.
The literature also shows how fsQCA can be employed to solve various time series problems. Huarng (2016), for example, used fsQCA to explore the relationships between energy consumption related antecedents and the outcome of gross domestic product. These regime switches identified by the drastic changes of the values in the antecedent combinations matched the real historical events, such as oil crises. Huarng and Yu (2017) employed fsQCA to analyse the autoregressive relationships of upward and downward regime switches for in-sample Taiwan Capitalisation Weighted Stock Index. The relationships are used to forecast regime switches in out-of-sample data. Taking data on the Taiwan Capitalisation Weighted Stock Index for analysis, the empirical results again show that fsQCA provides strong predictive validities.

Data
This study uses Google Trends to search for keyword '591.com', one of the most popular housing websites in Taiwan and obtain the target data for this study (Google Trends data hereafter). The monthly data spanned the period January 2010 to March 2018. The data for the period January 2010 to November 2016 were used as the insample, and the data for the period December 2016 to March 2018 were used as the out-of-sample.

Seasonality
We define a seasonal time series, k, as follows: where t is the time period, and obs(t-k), obs(t-k þ 1), … are the Google Trends data. Because this study is modelling housing demand, we set k to 11. For example, when we want to establish the relationships among the data, we present:

FsQCA
Based on fuzzy sets and set theory, fsQCA is a popular qualitative analysis method and tool. There are four major differences between fsQCA and MRA (Ragin, 2008). First, MRA creates coefficients for the dependent variables that are symmetrically designed. Secondly, fsQCA focuses on set relations that are asymmetrical, while MRA conducts analysis on the data directly. However, the data for fuzzy sets must be calibrated before analysis. Thirdly, MRA treats each independent variable individually, whereas fsQCA yields antecedent combinations as the sufficient conditions for an outcome. Fourth and lastly, MRA focuses on net effects, but in fsQCA there may be multiple antecedent combinations for the same outcome.
For fsQCA analysis, we first need to calibrate the data into values between 0.0 and 1.0, where 0.0 means full non-membership, 1.0 means full membership, and 0.5 means neither non-membership nor full membership. Using Google Trends, we search 591.com for the data. Because the values of the data from Google Trends fall into the range between 26 and 100, we set 80, 60, and 40 to be 1.0, 0.5, and 0.0, respectively, for calibration.
This study conversely proposes a new method called de-calibration. In other words, after forecasting, we need to covert the fuzzy values back into real data so that we can compare the forecasting performance. For example, the value of 1.0 is de-calibrated to 80, 0.5 to 60, and 0.0 to 40.
FsQCA analyses the relationships between the calibrated data and renders antecedent combinations, such as: where X and Y are the antecedents (equivalent to independent variables), and Z is the outcome (equivalent to the dependent variable). The symbol Ã represents AND; the symbol $ represents NOT; and the symbol ! means 'is sufficient for'. The antecedents connected by Ã and $ are called antecedent combinations (AC), such as $ X Ã Y. The above equation means that the antecedent combination $ X Ã Y is a sufficient condition for Z. In other words, $X Ã Y can lead to Z.
We calculate consistency as follows: where comp represents the calibrated value for each AC, and y is the calibrated value of the outcome.

Empirical analyses
For both the in-sample and out-of-sample data, we first conduct calibration. Based on the calibrated data, we then proceed to the model and lastly to the forecast.

In-sample estimation
The in-sample data allow us to establish relationships. For example, the in-sample data are: Based on the above relationship for December (as the outcome), fsQCA generates two ACs leading to this outcome (December): All the ACs can be generated for the other months, as seen in Table 1. Note that there can be multiple ACs leading to one month. Table 2 lists the correspondences for all the ACs.

Out-of-sample forecasting
Because we execute the forecast one by one, there is no summation for the function of consistency. Therefore, the equation becomes: For out-of-sample forecasting, we have comp 0 and y 0 , representing the calibrated values of the out-of-sample: In other words, we have: min comp 0 ; y 0 À Á ¼ comp 0 Â consistency; whereby :  If comp 0 ! y 0 , then min(comp 0 , y 0 ) ¼ y 0 ; If comp 0 <y 0 , then there is no way to know y 0 from min(comp 0 , y 0 ).
Hence, we suppose min(comp 0 , y 0 )¼y 0 . In other words, y 0 ¼ comp 0 Â consistency. The value of consistency is known from the in-sample estimation. The values of comp 0 are calculated by the out-of-sample data as seen in Table 3. Thus, we are able to calculate the value of y 0 as Table 4 shows.
The values of y 0 may be greater than or lower than 0.5. If the future trend is clearly rising or dropping, then we can determine whether the values are greater or lower than 0.5. Suppose the trend is upward (downward); the values therefore need to be greater (lower) than 0.5. If the values are opposite to this, then we need to convert them by 1-y 0 .
All the values (except comp1 0 for 2017 April and comp1 0 for 2017 December) in Table 4 are lower than 0.5. However, we know that the trend is upward. Hence, we take 1-y 0 for all the values as presented in Table 5. As for the values greater than 0.5, they remain the same. As stated previously, we de-calibrate the values back to the real data, listing the de-calibrated results in Table 6.

Performance evaluation
There are different ways to validate the empirical results. First, as in many fsQCA studies, we use an XY plot to see if the forecasts fall in the upper-left triangle of the chart, which would be considered as good forecasts. Figure 1   are above the diagonal line and are therefore considered as good forecasts. Table 7 shows that all the forecasts are denoted as being good, except that for 2018 February.
Secondly, one can always use conventional root mean squared errors (RMSEs) to test a forecasting performance. There is a de-calibrated value corresponding to each relationship. When there are multiple relationships for the same outcome, there are multiple de-calibrated values. Hence, the choice of which de-calibrated value as the representative value to the outcome becomes an issue. There are three possible ways to decide. Method 1 picks the largest product values for different comps. Method 2 chooses the smallest product values for different comps. Method 3 chooses the largest product values, but with values greater than 0.5 first. Table 8 compares the RMSEs.

Discussion and conclusion
To tackle a seasonal time series for housing demand, this study uses fsQCA (1) to model the time series of an AR(k) and (2) employs the model to forecast the AR  time series. First, this study demonstrates how Google Trends data can represent a proxy source of big data, offering a proper standard for industry. Secondly, fsQCA generates multiple relationships that lead to the same outcome. In other words, multiple causes can lead to the same result. Thirdly, when facing a small dataset, fsQCA is still able to successfully model and forecast a higher-order time series that outperforms conventional statistical methods.
There are a few limitations of this study. As discussed in the text, the equation to calculate consistency involves a minimum function, which absorbs some amount of information. Hence, during out-of-sample forecasting, there is no way to discover the absorbed information. In addition, multiple relationships also cause bias or problems in out-of-sample forecasting. A consistent approach is needed to determine which relationship is more suitable in a given situation.
Following these limitations, we suggest that future studies focus on how to choose the proper relationship from multiple relationships. Such investigations can be done by applying domain-specific knowledge, such as through heuristics or with domainindependent knowledge. Another interesting topic to work on is how to retain the absorbed information in a minimum function.