Topic modeling for analyzing open-ended survey responses

ABSTRACT Open-ended responses are widely used in market research studies. Processing of such responses requires labour-intensive human coding. This paper focuses on unsupervised topic models and tests their ability to automate the analysis of open-ended responses. Since state-of-the-art topic models struggle with the shortness of open-ended responses, the paper considers three novel short text topic models: Latent Feature Latent Dirichlet Allocation, Biterm Topic Model and Word Network Topic Model. The models are fitted and evaluated on a set of real-world open-ended responses provided by a market research company. Multiple components such as topic coherence and document classification are quantitatively and qualitatively evaluated to appraise whether topic models can replace human coding. The results suggest that topic models are a viable alternative for open-ended response coding. However, their usefulness is limited when a correct one-to-one mapping of responses and topics or the exact topic distribution is needed.


Introduction
Surveys are a pivotal research instrument to gain insight into a study subject. In market research, for example, surveys facilitate eliciting the opinions, attitudes, and preferences of consumers and thus provide critical insights for product development and business process management. Open-ended (OE) questions are a crucial component of surveys. They are used to clarify ambiguities and identify opinions that researchers have not thought of before (Lazarsfeld, 1935;Roberts et al., 2014;Schuman, 1966). Likewise, OE questions provide an opportunity to elicit a subject even if a research lacks sufficient knowledge about the topic to define a closed question (Converse, Jean McDonnell, & Presser, 1986). Another advantage of OE questions compared to closed questions is the ability to detect spontaneous thoughts and explore attitudes. Accordingly, common use cases of OE questions in market research include measuring the awareness and recall of brands, attitudes towards a product, or activity as well as likes and dislikes among consumers (Brace, 2018).
However, OE questions also have a major disadvantage: their analysis is associated with high workload. Aiming to identify the topics mentioned in the OE responses and their relative importance, the typical approach requires analysts to read and categorize all or a selection of responses manually (Roberts et al., 2014). Such manual process is timeconsuming and prone to errors, especially when multiple researchers analyse the responses separately (between-rater variance) (Tinsley & Weiss, 1975).
Topic models cluster documents based on the assumption that each document is a mixture of latent topics. A quasi-standard in this field is Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003). However, LDA is less suitable to process short texts such as OE responses (Sridhar, 2015;Tang, Meng, Nguyen, Mei, & Zhang, 2014). Therefore, the paper consolidates previous work on short text topic modelling and tests the effectiveness of corresponding methods to analyse OE responses in market research.
The short text topic models considered here include Roberts et al. (2014) who implement Structural Topic Models and Leleu et al. (2011) who use Latent Semantic Analysis (LSA) (Landauer & Dumais, 1997) to analyse OE responses. Yet, Roberts et al. (2014) have a different focus than the current paper, namely the integration of covariates into topic models, and Leleu et al. (2011) forgo a quantitative and qualitative assessment of the topics although this is essential for the current paper's objectives. Hence, to the best of the authors' knowledge, the literature still lacks a systematic analysis of the potential of topic modelling for OE responses.
Several studies focus on topic extraction from text data that share some characteristics with OE responses, including tweets (Bicalho et al.;2017;Hong & Davison, 2010;Jin et al., 2011;Mehrotra et al., 2013;Nguyen et al., 2015;Weng et al., 2010;Yan et al., 2013;Zhao et al., 2011;Zuo et al., 2016), weblogs (Singh, Waila, Piryani, & Uddin, 2013;Tsai, 2011) and online reviews (Brody & Elhadad, 2010;Titov & McDonald, 2008). Due to the lack of established approaches for OE responses, we examine whether approaches for those three types of corpora can be adapted to OE responses. To shed some light on this matter, Table 1 outlines the most important similarities and differences of OE responses on the one side and tweets, weblogs and online reviews on the other side.
As seen in Table 1, mircoblog entries resemble OE responses in terms of the use of informal language. An important difference concerns the number of covered topics. While tweets usually address a single topic, OE responses often cover multiple ones. The text length is another characteristic where tweets and OE responses display similarities but also differences. Twitter enforces a maximum length of 140 characters per tweet. Market research surveys do not enforce a maximum length for OE responses so that these can be substantially longer. In practice, however, survey respondents often provide only short answers to OE responses. For example, Gendall, Menelaou, and Brennan (1996) report an average response length between 4.5 and 7 words per response. These figures are consistent with the experience of the market research agency that supports the focal study through providing real-world data. As detailed in Section 3.1, the data we employ exhibits an average length of 5.5 words per OE response. In this regard, we suggest that the length of tweets and OE responses is, in practice, often similar on average whereby the length of OE responses exhibits much larger variance than that of tweets. This also suggests that microblog entries are more similar to OE responses than weblog entries and online reviews, which share the language style but differ in document length.
The shortness of OE responses, which is often observed in practice, represents the main challenge for topic modelling in the market research context considered in this study. As microblog entries and OE responses resemble each other in terms of length (Naveed, Gottron, Kunegis, & Alhadi, 2011), a brief overview of related work with a focus on topic modelling for short text, mostly applied to tweets, is provided in the following.
Several techniques for extracting topics from short texts have been proposed in the literature. A recent study of Bicalho et al. (2017) systematizes the field and introduces a general framework for overcoming the specific challenges of short text topic modelling. In general, short text topic models split into two categories: The first one uses auxiliary information to enrich the input (knowledge-based approaches). Examples include corpus-related metadata (Hong & Davison, 2010;Mehrotra et al., 2013;Weng et al., 2010), external knowledge sources like auxiliary long text (Jin et al., 2011;Phan et al., 2008) or word embeddings (Bicalho et al., 2017;Nguyen et al., 2015). The second category includes corpus-based approaches that rely exclusively on the target corpus, meaning the text corpus from which topics shall be extracted; such as the collection of OE responses in this paper. Corpus-based approaches modify the topic modelling process itself (Mihalcea, Courtney, & Strapparava, 2006). Examples include the introduction of stronger assumptions about the data (Bicalho et al., 2017;Nguyen et al., 2015;Zhao et al., 2011) or the manipulation of the document generation process (Yan et al., 2013;Zuo et al., 2016). Table 2 outlines relevant prior studies, divided into knowledge-based and corpus-based approaches, including the respective target corpora and methodology. It further shows where to localize the current study, which fills the gap of short text topic models applied to OE responses in both categories.
Using a set of real-world OE responses from a market research company, this study explores the potential of three short text topic models for OE responses and compares them to LDA as a benchmark: Latent Feature LDA (LFLDA) (Nguyen et al., 2015), Biterm Topic Model (BTM) (Yan et al., 2013) and Word Network Topic Model (WNTM) (Zuo et al., 2016). In each of the three • Document shortness, informal language (Naveed et al., 2011) • While OE Responses can be much longer than tweets, survey respondents often provide only relatively short answers of 4.5 to 7 words on average (Gendall et al., 1996) • Coverage of a single topic  • Coverage of broad topics like politics or sports (Hong & Davison, 2010;G. Lockot, personal communication, September, 2017) Weblog entries

• Informal language
• Document length (Singh et al., 2013) Online reviews • Informal language • Topic granularity (focus on specific details) (Liu, 2012) • Document length studies, the proposed short text topic modelling approach has been compared to LDA as baseline using data related to microblog entries. The studies consistently observe an improvement over this baseline suggesting that all three methods outperform LDA on microblog entries. WNTM additionally shows good performance when dealing with topic imbalance (Zuo et al., 2016). This is relevant for OE responses as usually some topics are mentioned much more frequently than others. Further, the methods are not associated with any assumptions or requirements that are not transferable to OE responses, like the restriction of having only one topic per document or the need for metadata. Hence, we consider their potential for analysing OE responses as high. Table 2 suggests that the extraction of topics from short texts has received considerable attention in previous work. However, we also observe from Table 2, that corresponding studies have not looked into the specific application context of OE responses, which is the goal of this paper. Using real-world data from user surveys, we add to the literature by providing original empirical evidence concerning the potential of selected short text topic models in OE response processing. More specifically, the paper makes two contributions: First, it investigates the extent to which topic modelling can replace manual analysis of OE responses. To that end, we evaluate topic model results along two dimensions: the comprehensibility of extracted topics (topic quality), and the amount of information to represent OE responses and derive the topic distribution (topical document representation). Both dimensions are relevant for the suitability of topic modelling in market research. Second, the paper elaborates on the relative merits and demerits of alternative short text topic models to provide guidance for researchers and practitioners how to choose the right method for a given market research task.

Latent Dirichlet allocation
Topic modelling is an approach to cluster text documents, assuming that each document is a function of latent variables called topics (Aggarwal & Zhai, 2012). LDA, introduced by Blei et al. (2003), represents a state-of-the-art method in this field (Hong & Davison, 2010). Yet, despite its wide popularity, LDA does not work well for every kind of text data. While it successfully models topics for corpora like news articles (Blei et al., 2003) and scientific papers (Griffiths & Steyvers, 2001), it shows disappointing results for short documents and small corpora 1 (Sridhar, 2015;Tang et al., 2014). In the latter cases, data sparsity and limited context prevent a reliable extraction of document-based word co-occurrences, which is the basis for LDA (Sridhar, 2015). Also, LDA tends to detect frequent topics better than rare ones (Zuo et al., 2016) and broad topics better than specific ones (Titov & McDonald, 2008). Thus, corpora with imbalanced topic distributions and those that require a detailed analysis are also challenging. These critical characteristics apply to OE, which leads to the assumption that LDA is not ideal for this kind of data. LDA serves as benchmark in the empirical part of the paper and foundation to introduce short text topic models. LDA is a three-level hierarchical Bayesian model where each document d m is modelled as a finite mixture over a set of K corpus-wide topics z k (Blei et al., 2003). Each topic, in turn, is a distribution over a fixed set of V words w v . As a generative model, LDA assumes that the words that a document contains are generated by the latent topics. Therefore, LDA tries to infer the latent topics that could have generated the documents. For finding these topics, LDA uses the word co-occurrence pattern in the corpus, which is withdrawn from the document- term matrix (DTM). In doing so, a key component of LDA is the "bag-of-words" assumption, meaning that the order of words is ignored (Blei et al., 2003). The more often two words co-occur in a document, the more likely they belong to the same topic (Aggarwal & Zhai, 2012). The generation process can be formally described as follows (Blei et al., 2003): (1) For each topic z, choose the probabilities over words ϕ z ,Dir β ð Þ, where ϕ z is drawn from a symmetric Dirichlet prior distribution with parameter β.
(2) For each document d, choose the probabilities over topics θ d ,Dir α ð Þ, where θ d is drawn from a symmetric Dirichlet prior distribution with parameter α.
(3) For each word w dn in document d, choose a topic z dn ,Multinomial θ d ð Þ and then choose a word w dn from the multinomial distribution w dn ,Multinomial ϕ z dn .
The functioning of LDA is often illustrated using the plate notation of Figure 1 where a circle represents a random variable and an arrow a unilateral dependency between variables. The processes within a box are repeated multiple times with capital letters giving the number of repetitions.
The number of topics K as well as the Dirichlet hyperparameters α and β are determined prior to modelling. The parameter α denotes the prior documenttopic distribution and the parameter β the prior topicword distribution (Griffiths & Steyvers, 2001). The posterior distributions of θ d , ϕ z and z are inferred by using collapsed Gibbs sampling (Griffiths & Steyvers, 2002), following previous works (Griffiths & Steyvers, 2001;Nguyen et al., 2015;Yan et al., 2013;Zuo et al., 2016).

Application of topic models to open-ended responses
Market researchers are mainly interested in two things: Identifying the topics that are mentioned in OE responses and the topics' relative distribution. The former is provided by the posterior topic-word distribution ϕ, which is one output of a topic model. ϕ provides the likelihood for each word belonging to each topic. By considering only the top words, i.e. those that are most likely to appear in a topic, one can derive the content of the topics (Blei et al., 2003). The top words are most interesting because the lower the topic-word probability, the weaker the topic-word relation. Topic models do not provide labels for the topics so that the interpretation and labelling of extracted topics is left to the researcher (Schouten & Frasincar, 2016).
The posterior document-topic distribution θ d can provide insights into the topics in addition to the top words. θ d is represented as a M Â K matrix where for each document d and each topic z, the probability PðzjdÞ shows how likely it is that z is present in d. θ d can be used to find the most representative documents (top documents) for z, i.e. the documents with the highest document-topic probability for z. The top documents can help to further describe a topic (Aggarwal & Zhai, 2012).
The share of documents that contain a topic compared to the corpus size can also be derived from θ d . By choosing a threshold t, one can assign only those topics to each document for which P zjd ð Þ> t. This can be used to compute the share of the topics over the whole corpus. In market research, the share of documents corresponds to the share of respondents mentioning a certain topic.

Short text topic models
This section introduces the three short text topic models LFLDA, BTM and WNTM. It briefly presents the differences to LDA and explains why they are more suitable for OE responses. Nguyen et al. (2015) complement the sparse cooccurrence pattern in short documents through integrating vector representations of words (hereinafter: word vectors). They use two sets of pre-trained word vectors: The first one is trained on a subset of the Google News corpus via Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013) and the second one on Common Crawl web data via Global Vectors for Word Representation (GloVe) (Pennington, Socher, & Manning, 2014).

LFLDA
As for LDA, LFLDA tries to find the latent topic structure that could have generated the observed documents. The generative process is similar to LDA but differs in the way how words are generated from topics. In LDA, a word can only be drawn from the Dirichlet multinomial distribution ϕ that is trained on the target corpus. In contrast, LFLDA allows each word to be drawn from either that distribution or from a multinomial distribution based on the vector representation of every word and topic in the corpus. By incorporating the vector representations, LFLDA uses information about word-topic relations from larger external corpora. Hence, LFLDA circumvents the issue of LDA with the sparse information in short text about the word cooccurrence structure.
To determine from which of the two distributions a word w dn is drawn, a binary indicator variable s dn is sampled from a Bernoulli distribution Ber λ ð Þ. The hyperparameter λ determines the probability with which a word is sampled from the latent feature component.

BTM
In contrast to LFLDA, BTM (Yan et al., 2013) does not use an external knowledge source to deal with the short documents' missing context. However, it differs from LDA in two other regards that concern the topic modelling input and the generative process.
First, the input to topic modelling is not the set of documents D as in LDA but the corpus-wide set of biterms B. A biterm b is defined as "an unordered word-pair co-occurred in a short context" (Yan et al., 2013(Yan et al., , p. 1446) where a short context denotes a document. For example, the document "great customer service" consists of three biterms: "great customer", "customer service" and "great service". The biterm approach of LFLDA bases on the assumption that there is a topic distribution θ for the entire corpus instead of a topic distribution θ d for each document. Consequently, the hyperparameter α denotes the prior corpus-topic distribution and not the document-topic distribution. 2 Second, LDA uses the word co-occurrence pattern per document to generate words. In contrast, BTM generates biterms instead of single words. The aim of the generative process in BTM is finding the latent topics that could have generated the biterms, which make up the corpus.
As the topic inference in LDA is based on the word co-occurrences per document, the issue with short text like OE responses is that their shortness leads to a relatively sparse word co-occurrence structure per document. The major advantage of BTM is that it uses the entire corpus as input, which makes the topic model insensitive to document shortness and hence improves the detection of topic-word relations.

WNTM
WNTM (Zuo et al., 2016) infers topic distributions for words instead of documents to circumvent the sensitivity of LDA towards document length. This requires a transformation of the input documents. By moving a sliding window of length S through each document, a word co-occurrence network is created where the network nodes represent the vocabulary of the corpus and the edges the co-occurrences of each word pair weighted by the number of co-occurrences in the corpus. Subsequently, for each word w v a pseudodocument d p is created that consists of all words that co-occur with w v , i.e. all words that are connected to w v in the word network. Instead of using the original text documents as input to topic modelling, as done in LDA and LFLDA, the newly generated pseudodocuments are used as input in WNTM. Hence, the key difference between the generative processes of LDA and WNTM is that WNTM does not generate the original but the pseudo-documents.
The key difference between the output of LDA and WNTM is the interpretation of θ d p which denotes the probability of each topic being present in a pseudodocument d p . A pseudo-document entails a word's context information across the entire corpus. Hence, θ d p is regarded as the distribution over topics for each word, where each word in turn is represented by its pseudo-document.
The advantage of using WNTM for short text like OE responses is twofold. First, modelling topics for words by considering a word's co-occurrences across the entire corpus decreases the model's problem with document shortness. Similar to BTM, this improves topic extraction as the words' contextual information are not limited to the co-occurrences within adocument. Second, there are more words than documents that are related to rare topics. Thus, the authors claim that WNTM is better capable of detecting rare topics than other topic modelling approaches (Zuo et al., 2016). This is relevant for OE responses as usually some topics are mentioned by much more respondents than others.

Data
To examine whether topic modelling can serve as an alternative for analysing OE responses and which of the selected topic models works best for this kind of data, several experiments are conducted on realworld OE responses. The data source and preprocessing tasks as well as a summarization of the corpus' main characteristics are presented in the following.

Data source
The dataset is provided by a Berlin-based market research company (hereinafter: partner company). The data belongs to an online survey of software developers, which is repeated quarterly. The current paper focuses on an OE question of this survey where developers are asked why they recommend developing on a certain platform. The data was gathered between December 2014 and July 2017 and 7,743 responses are available for this question. This set of responses makes up the target corpus for this paper.
Each quarterly repetition of the study is analysed separately by the partner company. Because of the high workload that is associated with the evaluation of OE responses, only a random sample of approximately 450 responses per wave is manually coded. This leads to 5,001 labelled responses in total. There are nine different labels that can be assigned to the responses. Responses that cannot be assigned to any of those labels are classified as "other". This "other" category is a collection of side issues deemed too small to get an own label. A team of researchers is responsible for coding, some of whom have been involved in the project from the start while others were only involved in some waves. In total, seven researchers have been involved in the coding (G. Lockot, personal communication, September, 2017).

Pre-processing
Several pre-processing steps are conducted to increase the quality of the dataset and to transform data in such a way that it complies with the requirement of (short text) topic models. First, standard preprocessing tasks are performed, including the translation of non-English responses, lemmatization, conversion to lowercase and the removal of numbers, punctuation, stop words and infrequent words (Manning, Raghavan, & Schütze, 2009). This leads to a vocabulary of V ¼ 766 unique words and a corpus of M ¼ 7; 622 documents.
For LFLDA, BTM and WNTM some method 3specific data preparation is performed. For LFLDA, a set of pre-trained GloVe word vectors (Pennington et al., 2014) is chosen following (Nguyen et al., 2015). The set is trained on 42 billion tokens of Common Crawl web data and contains 300-dimensional vectors for 1.9 million words. 4 For BTM, all documents shorter than two words are excluded from model training, which leaves 6,993 documents. Similarly for WNTM, all documents shorter than the window size S are excluded from topic modelling. By setting S = 3 in this work, the ratio between average document length and window size is similar to the one used in the original work by Zuo et al. (2016). This leads to 5,776 documents for model training. Later, topics can also be inferred for the documents that are excluded from model training in BTM and WNTM.

Descriptive analysis
To get a clearer picture of the data, several descriptive analyses are conducted. After pre-processing the documents contain between one and 160 words with an average length of 5.5 words while 75% of the responses contain seven words or less. Recall that these values can differ for certain short text topic models due to model-specific preprocessing. For example, the minimum word length per response will be two and three for BTM and WNTM, respectively. In general, one may question the minimum and maximum number of words per response. For example, a text of 160 words may not be regarded as short anymore; after all it is much longer than a tweet. In this study, we do not enforce pre-defined thresholds, unless required by a specific topic modelling method. Rather, we employ common text preprocessing techniques and proceed with the resulting document lengths. Given the scarcity of prior work dedicated to topic modelling from OE responses, we suggest that the application of a text standard preprocessing pipeline is suitable for this paper. Enforcing overall limits of the minimum and maximum number of words per response would require a systematic approach to set these limits. Developing corresponding methodology is a valuable goal for future research but beyond the scope of this paper, which seeks insight into the relative suitability of available short text topic models for OE response processing.
Aside from the document length, the distribution of the manual labels is of interest as they serve as a gold standard for the evaluation in this study. The pre-processed corpus includes 4,958 labelled documents for all methods. Most documents are assigned to only one label but there is also a significant share of documents with multiple labels (Figure 2). This supports the assumption that topic models that allow only one topic per documentas for instance used in Zhao et al. (2011) for tweetsare not suitable for OE responses. Aside from the number of labels per response, the overall importance of each label is relevant. The set of labelled responses shows an imbalanced label distribution, i.e. the share of responses assigned to each label differs significantly as depicted in Figure 3. 5 It means that there are substantially more documents that provide information about some labels than others. Appendix A provides short descriptions of the labels.

Model implementation
The three short text topic models and LDA as benchmark are implemented using R, Python, Java, C++ and Bash. The detailed technical specification of the infrastructure employed for data pre-processing, model fitting and evaluation is as follows: of a personal computer with Intel i7-6500U CPU, running on Windows 10 with R version 3.4.2, Java Development Kit version 1.7 and Python version 3.5. LDA is trained using the R package topicmodels (Hornik & Grün, 2011). For the other three methods, published source code13F 6 is used and adapted to the present application (e.g., hyperparameter settings and evaluation).
For each method, different hyperparameter settings are evaluated. Some authors (Lu, Mei, & Zhai, 2011;Yin & Wang, 2014) suggest smaller values for α within conventional LDA when applied to short text to improve performance compared to the common setting of α ¼ 50=K. For instance, Yan et al. (2013) use α ¼ 0:05 and use Nguyen et al. (2015) α ¼ 0:1 when using LDA for short text. Moreover, Tang et al. (2014) propose smaller values for β when dealing with short text, for example β ¼ 0:01 as set in Nguyen et al. (2015) and Yan et al. (2013). Therefore, it is assumed that rather small values for α and β are appropriate in this work. This implies that documents are associated with rather few topics (small α) and that topics are rather word-sparse and thus better to distinguish from each other (small β). Guided by the parameter settings with the best performance in the original papers (Nguyen et al., 2015;Yan et al., 2013;Zuo et al., 2016), two values for each of the hyperparameters are implemented. For reason of comparability, the values for α (for BTM: α B ) and β are identical for all methods. Moreover, for each method, the number of topics K is varied from five to 50 with a step size of five. As the number of topics mentioned by respondents can change for different studies, this variation is important to understand how the models behave when K is small or large. The range for K is chosen based on the manual labels given. The lower boundary is very close to the original number of labels. Meanwhile, the upper boundary is a trade-off between a sufficiently large value to observe a trend based on varying K while sustaining the feasibility of a manual inspection of topics. Table 3 summarizes the hyperparameter settings and the resulting number of models trained per method. This amounts to 200 models in total. Parameter inference is done via Gibbs sampling with 1,000 iterations for all models. Finally, Figure 4  The responses with zero labels are not unlabeled responses. Here, the researchers decided that they could not assign the responses to any of the nine labels. So, they assigned them to the previously mentioned "other" category.
summarizes the overall architecture of the experiments.

Performance measurement
Lau, Newman, and Baldwin (2014) and Chang, Boyd-Graber, Gerrish, Wang, and Blei (2009) suggest that topic models have two main use cases, direct human consumption and text preparation. The former case entails a manual analysis of extracted topics to interpret their meaning while in the latter case another text processing algorithm, for example a text classifier, operates on the basis of the extracted topics. In this paper, both perspectives are relevant.
First, the topics must be sufficiently clear for exploratory purposes (in the following referred to as quality of topics). A statistically reasonable topic is not necessarily regarded as meaningful by a human (Newman, Karimi, & Cavedon, 2009). Some topics (e.g., "advertisement, targeting, audience, viral, brand") may be perceived as more interpretable than others (e.g., "company, time, easy, app, tools"). A common approach is to evaluate the quality of topics by considering its top ten words, i.e., the ten words that are most likely to be drawn from that topics (Newman, Lau, Grieser, & Baldwin, 2010). This procedure is also used here.
Second, the topics need to contain enough information to represent the documents appropriately (in the following referred to as topical document representation). This is required to deduce the topic distribution, i.e., the share of responses mentioning each topic. It is common practice to evaluate the topical document representation based on the performance of topic models on extrinsic tasks like document clustering or classification (Blei et al., 2003;Nguyen et al., 2015;Yan et al., 2013;Zuo et al., 2016).
Both dimensionsquality of topics and topical document representationare evaluated in this paper using a quantitative as well as a qualitative approach for each. The quantitative approaches make it possible to objectively compare the topic modelling methods. Meanwhile, the qualitative approaches complement the quantitative evaluation by gaining a deeper insight into some selected examples of topics or topic models. The latter also allows to integrate expert knowledge. Table 4 summarizes how the model evaluation will be conducted on the four dimensions.
The dual evaluation approach of assessing extracted topics from a quantitative and qualitative angle is beneficial to obtain a comprehensive picture of the potential of short text topic models. However, the evaluation approach also has implications that need to be acknowledged. On the one hand, the quantitative assessment requires OE responses to have undergone manual labelling. The assessment then translates into comparing manual to algorithmically generated labels. The qualitative evaluation, on the other hand, requires the involvement of market research experts to judge extracted topics and compare the outputs of different short text topic models   (Denny, 2017)); the closer the score to zero, the higher the indicated coherence Benefits: No external information needed, high correlation with human judgement (Lau et al., 2014;Mimno et al., 2011) References: (Yan et al., 2013;Zuo et al., 2016) Goal: Compare all topic models with regards to topical document representation Metric: F1 score for document classification with Support Vector Machines (SVM) (Manning et al., 2009;Van Rijsbergen, 1979) Calculation: Fit a binary classification task for each of the nine labels (dependent variable) where the document-topic probabilities θ d are the independent variables (Manevitz & Yousef, 2001), using SVM as a classifier (Implementation with the R package caret (Kuhn, 2008); calculate performance metric F1 score per classification task and average over all tasks to get a single metric per topic model (Manning et al., 2009) Benefits: Metric is common in information retrieval (Van Rijsbergen, 1979), SVM have shown to be effective in text classification (Manning et al., 2009) References: (Blei et al., 2003;Nguyen et al., 2015;Yan et al., 2013;Zuo et al., 2016) Qualitative Goal: Understand the usefulness of exemplary topics by leveraging expert knowledge Procedure: Two experts from the partner company independently interpret eight topics (two topics per method), label them and compare them to each other without knowing which topic is produced by which method Goal: Investigate if the topic distribution of exemplary topic models on a corpus-level is a good approximation to the distribution of the manual labels ( Figure 3) Procedure: First, for K ¼ 10 and K ¼ 20, the topic models with the best quantitative performance are chosen for further investigation (these values of K are chosen together with the experts as K ¼ 10 is close to the original number of labels and K ¼ 20 approximately represents the number of sub labels the experts see in the data; this is to see how K affects the performance on topical document representation).
Then, for both topic models, the topics are matched with the manual labels and a topic z is assigned to a document d if the document-topic probability is larger than a threshold t (using different values for t); based on this allocation, the topic distribution is calculated and compared to the label distribution to one another. Therefore, the quantitative and qualitative evaluation both enforce sharp constraints on the type and amount of data that can possibly be considered in the study. As explained above, we have access to roughly 5,000 OE responses gathered from a recurring survey between December 2014 and July 2017. Expanding the amount of data were desirable but is prohibited by the strict requirements of the evaluation approach. This also implies that research findings and conclusions are limited to the specific type of OE responses employed in the study while a replication of the empirical analysis to test external validity is left to future research.

Quality of topicsquantitative evaluation
For each value of K, four models are trained for LDA, BTM and WNTM each using different hyperparameter combinations of α and β. For LFLDA, eight models are trained, as this method additionally includes the hyperparameter λ, for which also two values are used. Figure 5 gives an overview of the coherence scores produced for the different methods. The closer the coherence score to zero, the higher the topic coherence averaged over all topics produced by a topic model. The scores for all trained models are reported in Appendix B. The lines in Figure 5 show the best scores reached by each method across all hyperparameter settings. These show that no method significantly outperforms the others for K 10. In contrast, for K ! 15, BTM achieves the highest scores and its advantage increases with K.
Yet, the lines only show the best coherence scores produced by each method. To examine if the superiority of BTM depends on a certain hyperparameter setting, the shaded areas in Figure 5 depict the ranges of scores per method that are produced by the different parameter settings. The boundaries of the shaded areas equal the scores for the best (upper boundary) and the worst parameter combination (lower boundary) for each K. The figure shows that the performance of BTM is less sensitive to different parameter settings compared to the other methods, meaning that the coherence scores achieved by the best and the worst models differ less. Yet, it must be noted that twice as many hyperparameter settings are implemented for LFLDA, which limits the comparability to the other methods' ranges. However, there is no hyperparameter combination that consistently produces the best results for any method (Appendix B).
Another interesting observation is the downward trend of all methods' scores with an increasing number of topics. One possible reason is that all topics are generally worse when K is high. Another explanation could be that there are still good topics but as there is only a limited number of topics in the corpus, increasing the value of K leads to more nonsense topics with very low coherence scores. Eventually, this decreases average coherence scores. To investigate this, Figure 6 depicts for every method and every K the scores of the most and least coherent topics over all models. Notably, the best scores produced by all methods show no dependence on the number of topics. This means that regardless of the value of K, there is still at least one relatively good topic. In contrast, the scores of the least coherent topics decrease remarkably with K. Both observations indicate that topic models with a high number of topics still produce good topics but the larger K, the more incoherent topics are produced which decreases the average scores. In summary, the quantitative evaluation of topic coherence indicates that BTM produces on average more coherent topics regardless of the hyperparameter setting. Apart from that, it is hard to recognize a difference between LDA, LFLDA and WNTM. For some values of K, LDA even outperforms LFLDA and WNTM although the differences are comparatively small. Moreover, the results show that the different numbers of topics reveal valuable insights since K influences the model ranking as well as the absolute coherence scores.

Quality of topicsqualitative evaluation
This section explores the topic interpretability from a qualitative perspective. To achieve this, the opinions of two domain experts are used and compared to the quantitative coherence scores. Only the models for K = 20 are considered for the qualitative evaluation. This value is chosen based on two criteria: First, it is relatively close to the number of original labels, which is nine.
This increases the likelihood that the topic granularity is similar to the one the experts are used to. Second, as seen in Figure 5, BTM increasingly deviates from the other methods when K increases. For K = 20, there is already a notable distance between the score of BTM and the remaining methods. This helps to examine whether the experts' perception of differences in topic coherence is consistent with the quantitative scores. For each method and K = 20, the model with the highest average coherence score is considered. These are also the ones depicted by the lines plotted in Figure 5. Table 5 shows the eight topics and their coherence scores, which are evaluated by the two experts. The word lists are ordered by topic-word probability, i.e., the first word in each list is most likely and the last word least likely to be generated by the respective topic. Many words appear in every method (e.g., "easy" for topic A) but only few words are unique to one method. Further, the unique words are rather positioned at the end of the lists, meaning that the topics are even more similar when focusing only on the top words. Regarding the coherence scores, there is another interesting finding: The least coherent topic in the table is topic B of LFLDA and the most coherent one is topic B of BTM. However, both topics contain seven identical words in the beginning and only differ in the ordering and the last three words.
The evaluation through the experts happens separately but their opinions hardly differ. First, both state that all topics are generally understandable. Regardless of the methods, they interpret the topics as follow: Topic A is about good documentation and user-friendliness and topic B about the large user base of the platform. Both regard topic B as more coherent and useful than topic A because they see two separate themesdocumentation and userfriendlinessin topic A, which from their perspectives should belong to two separate topics. Meanwhile, topic B covers only a single topic and is therefore regarded as more coherent. This is not in line with the coherence scores, which indicate a higher coherence for topic A for LDA, LFLDA and WNTM and very similar scores for BTM. Moreover, one expert highlights the last two words  of topic B of LDA ("can", "way") which he regards as confusing in this context. In contrast, he likes the words "potential" and "wide" within LFLDA and WNTM and thinks they make the topic even clearer. This is again inconsistent with the coherence scores that indicate a higher coherence for LDA than for LFLDA. For topic A, one expert expresses a slight preference for LDA and the other one for LDA and LFLDA. However, they call it rather a gut feeling than a reasoned decision. For topic B, they state that the topics except for LDA are so similar that they cannot name a preference between BTM, LFLDA and WNTM. 7 To compare the topics, the experts mainly focus on the last words in the lists although these are less representative for the topics than the first words. However, the experts' approach is understandable because the last words are those that differentiate the methods from each other. It can be questioned whether the order in which the words appear in the topics really matters or if the words are more or less equally likely to be drawn from the topics. To investigate this based on an example, the topic-word distributions for topic B for BTM and LDA are explored. These two topics are of special interest regarding their last words as mentioned above: First, topic B of BTM achieves a notably higher coherence score than LFLDA although it differs only in the last three words. Second, one expert highlights the inappropriateness of the last two words of topic B of LDA "can" and "way". Figures 7 and 8 show for both topics that the words at the beginning of the lists are significantly more likely to be drawn from a topic than those at the end of the lists. Surely, a comparison of the topic-word distributions for all topics would allow a more complete and generalizable interpretation. But the two examples already show that one should be careful when putting too much weight on the last terms in the top word lists.
In summary, the qualitative evaluation shows that experts who are familiar with OE response coding regard the exemplary topics as interpretable. Further, the results imply that the qualitative evaluation is not always in line with the quantitative  coherence score. For instance, the clear superiority of BTM reflected in the quantitative scores is not reproduced by the expert judgements. Although it is not the purpose of this section to prove or disprove the reliability of the coherence score, previous results suggest that one should not have blind faith in it. Moreover, the investigation of the topic-word probabilities implies another interesting finding. Although it is common practice to look at top ten words lists when interpreting topics (Newman et al., 2010), one should maybe rethink approaches for topic visualization. As seen in Figures 7 and 8, the first terms in the top word lists should be weighted stronger than the last terms, but humans might be unable to weight terms accordingly when interpreting a topic.

Quality of topical document representationquantitative evaluation
For the evaluation of topical document representation, binary classification tasks are trained for each of the nine labels. For that matter, the document-topic probabilities θ d of each model are used as independent variables to predict the manually given labels (dependent variable) for each response. This approach facilitates examining whether the topic models contain enough information to assign each response to the correct manual labels. Many algorithms such as logistic regression are available for training a binary classifier. For this study, Support Vector Machines (SVM) are chosen as they have shown to be very effective for text classification tasks (Manning et al., 2009). To compare the topic models, the F1 score is used, which a common metric to evaluate information retrieval (Van Rijsbergen, 1979). It measures how accurate the classifier predicts the positive cases, i.e., the cases where the manual label was assigned to a response (Manning et al., 2009). First, the F1 score is computed per classification task, i.e., per label, and then averaged over all labels to get one overall score for each topic model. Figure 9 gives an overview of the average F1 scores produced by the four methods. The scores for all models are found in Appendix B.
The lines in the figure depict the best F1 score reached by each method. It shows that LDA achieves the lowest scores for 80% of the data points. Moreover, at each data point there is at least one model that performs better than LDA. For K ! 15, WNTM achieves the highest scores and its advantage over the other methods mainly increases with K. Aside from the method comparison, the graph shows that a higher number of topics leads overall to an increasing F1 score for all methods with few exceptions.
As the lines in Figure 9 only present the highest F1 scores achieved by each method, it can be questioned whether the superiority of WNTM depends on a certain hyperparameter setting. Hence, the shaded areas in the figure show the ranges of F1 scores for each method where the lower boundary indicates the lowest score achieved by a method and the upper boundary the highest one. The ranges achieved by BTM and WNTM are comparatively stable across all values of K, while LDA and LFLDA depend more strongly on the parameter setting. Hence, it cannot be deduced that the superiority of WNTM depends on a certain parameter setting. Moreover, there is no parameter setting for any method that always achieves the best performance.
So far, the F1 scores are averaged over all labels. However, as mentioned in chapter 2.1, some topic models like LDA struggle with topic imbalance, which often leads to the incapability to identify rare Figure 9. Best F1 score per method averaged over all labels (lines) and range of average F1 scores per method produced by different hyperparameter combinations (shaded area).
topics. As the label distribution in this study shows a notable imbalance (see Figure 3), we also investigated the methods' classification performance per label (see Appendix C). It can be observed that the scores differ remarkably between the labels and there seems to be a positive relation between the popularity of a label and the classification performance when predicting the same. For instance, a WNTM model achieves an F1 score of 0.7946 (best score across all values of K) for the label "Usability" which occurs in 30.19% of the responses. Meanwhile, the best F1 score achieved by WNTM for the label "Login", which is mentioned by 5.08% of the respondents, is only 0.5409. The same trend can be observed across all methods.
Altogether, WNTM achieves the best classification performance in terms of F1 score in most cases. When comparing the methods based on metrics that are averaged over all labels, one has to take into consideration that the classification performance differs notably between the nine labels. Overall, the labels that are frequently mentioned are predicted more accurately than the ones that are rarely mentioned.

Quality of topical document representationqualitative evaluation
This section reports to which extent the topic distribution is consistent with the label distribution, regardless whether each document is assigned to the right topic or not.
The two methods considered for that are BTM with α B ¼ 0:05 and β ¼ 0:1 for K ¼ 10 (F1 score: 0.4871) and WNTM with α ¼ 0:1 and β ¼ 0:1 for K ¼ 20 (F1 score: 0.4975). These are the models with the highest F1 scores for the respective values of K (see Appendix B). Starting with BTM and K ¼ 10, Table 6 presents the model's ten topics including top words. Each topic is assigned to one of the nine labels in coordination with two experts from the partner company. For most topics, the allocation is made only based on the top words while for a few topics that were less clear some top documents are considered to get more insights about the topics. Topic 10 cannot clearly be assigned to any label, even after reading some top documents. Further, no topics are available for the labels "Features", "Business" and "Data". In addition, there are some topics that seem to include two labels. For example, topic 1 entails words that indicate both labels "Usability" and "Documentation". However, based on the finding in section 0 that the topic-word probability drops significantly the later a word appears in the topic, more weight is put on the first words here. Therefore, topic 1 is assigned to "Usability".
Based on that allocation, Figure 10 shows the shares of documents that are assigned to each label via the document-topic distribution of the model mentioned above. The exact values are also depicted in Appendix D. Four different thresholds t ¼ 0:18; 0:21; 0:24; 0:27 f g to calculate the label distributions are reported here. Aside from the three labels mentioned above that are not present at all, the distributions derived via the thresholds differ in several points from the original label distribution. None of the thresholds leads to the same label distribution as the original one. Even when looking at single topics, there are only few relatively close matches. Regardless of the exact values, none of the thresholds leads to the correct ranking of labels that could reveal the relative importance of the topic compared to each other.
In the following, the same results are presented for the second exemplary model, namely WNTM and K = 20. Table 7 shows that this time each label is assigned to at least one topic.
The label distribution based on these topic allocations is depicted in Figure 11 (see Appendix D for the exact values). None of the thresholds t ¼ 0:12; 0:15; 0:18; 0:21 f g produces the same label distributions as the original labelling. Yet, for many labels the approximation achieved by t ¼ 0:15 is close to the original shares. Compared to Figure 10, the ranking of the labels is much better represented.
Altogether, this section provides insights on how well the topic distribution represents the original label distribution. First, the mapping of topics and labels demonstrates that in most cases the top words are sufficient to assign a topic to a label. In the remaining cases, the top documents provide further insights that facilitate the allocation. After that, the topic distributions are calculated for two models via different thresholds. The BTM solution with K = 10 does not cover all labels. In addition, the distribution of the covered labels differs significantly from the original one. However, the WNTM solution with K = 20 covers all labels and the label ranking is very similar to the original one, even if the exact distribution cannot be reproduced by any threshold.

Discussion
The use of text analytics is not yet considered an alternative to human coding, which has several reasonable grounds. First, an initial investment is required for tasks like finding the right algorithm and preparing the data before getting any insights. Further, topic modelling becomes significantly more accurate with an increasing number of responses (Tang et al., 2014) but a lot of market research studies suffer from a small amount of respondents (G. Lockot, personal communication, September, 2017). Content wise, topic modelling is inferior to the analysis through humans in several ways. Usually, topic models are not capable of discovering topics that are very detailed (Aggarwal & Zhai, 2012) or that show up rarely in the responses (Roberts et al., 2014). Moreover, it is easier to discover explicitly mentioned opinions with key words than implicitly described ones. While humans are mostly able to classify implicit mentions correctly based on common knowledge,  algorithms are usually not (Liu, 2012). Additionally, methods that are based on word co-occurrences and the "bag-of-words" assumption imply the limitation that semantics are ignored (Le & Mikolov, 2014). This makes the identification of implicit topics even more complicated as well as for example the identification of negation. Further, complex language like metaphors and humour are hard to analyse automatically without human intervention (Graesser & McNamara, 2012). Still, topic modelling provides a lot of opportunities for the analysis of OE responses. The most obvious one is saving time and money (Roberts et al., 2014). Especially on a large scale, the up-front costs can pay off quickly. Moreover, topic modelling can also add value with regards to content. For instance, it facilitates the analysis of corpora where researchers cannot build upon any prior knowledge. Further, it can help to reduce several human biases. First, algorithms, in contrast to humans, identify topics objectively and do not assume them (Roberts et al., 2014). Second, algorithms provide consistency, which is a major drawback of human coding. It is well known that different human raters do not provide consistent results (between-rater variance) (Tinsley & Weiss, 1975). And even if all responses were analysed by the same researcher, there would still be inconsistencies as humans for instance get tired or bored (Graesser & McNamara, 2012).
The applicability of topic modelling was investigated from different perspectives in this work to gain an overall impression. The first part of the results was focused on whether the topics were clear enough to be used for exploratory purposes. A quantitative coherence score was used to compare the methods where BTM mostly achieved the best performance. To the best of the authors' knowledge, there is no absolute threshold though that differentiates a coherent from a non-coherent topic and therefore the metric is rather used for relative comparisons. However, it was shown that the ranking based on this metric was not always consistent with expert judgements. The results indicate that what makes up a high coherence score and what is perceived as clear and useful by researchers can be different. It can also be questioned whether the chosen coherence score is suitable for the present dataset. The fact that it only uses the target corpus is certainly advantageous in some respects. But the downside is that the calculation suffers from the lacking cooccurrence patterns of the corpus. Further, it has been shown that the interpretation of topics should focus on the first words in the top word lists. Overall, the experts from the partner company assessed the exemplary topics as clear and helpful.
The second part of the results focused on whether the responses were accurately represented by the topics, which was again investigated from two perspectives: First, document classification and respective metrics were used to explore whether the topics provided enough information to predict the right labels for each response. Second, it was examined to what extent the distribution of the original labels could be reproduced by using the topic distributions. WNTM achieved the highest classification performance in 80% of the cases. Yet, the performance achieved by all methods was only moderate (highest F1 score over all models: 0.5474, see Appendix B). It must be noted that the results indicate a relation between classification performance and the frequency with which a label is mentioned. The prediction is substantially more accurate for frequent than for rare topics. Even WNTM, for which the authors claim that it is capable of handling topic imbalance (Zuo et al., 2016), showed this relation.
The moderate performance on classification tasks does not imply that topic modelling is useless for the analysis of OE responses. Discussions with experts have confirmed that it was much more important to get a suitable topic distribution over the entire response set than a correct one-to-one mapping of responses and topics. Two exemplary models that showed comparatively good results on classification were explored in that regard. For example, a WNTM model with 20 topics produced very promising results: All original labels and even important sublabels could be identified. Although the original ranking of the labels could not be entirely reproduced, the big picture was correct aside from some exceptions.
Overall, it has been observed that the number of topics negatively affects the average topic quality and positively affects the average classification performance. However, it is assumed that a larger number of topics does not generally lead to less coherent topics. Rather it is plausible that only a limited number of topics is available to be identified and therefore the higher the number of topics, the higher the number of nonsense topics. Further, Yan et al. (2013) mention that a small number of topics usually leads to very general topics that are hard to distinguish while a larger number of topics produces more specific ones. To make sure that all relevant topics are identified, it is thus recommended for OE responses to choose a rather high number of topics and sort out the meaningless ones. In doing so, the researcher has the chance to recognize the small and specific topics and can still decide to combine them to a larger one.
In summary, the current work has shown that topic modelling bears high potential for the analysis of OE responses but does not provide a stand-alone solution. The experts from the partner company state that an automatic approach for exploration and a good approximation for the label ranking would already be a major gain for many studies. Certainly, this work only focused on one dataset and the opinion of experts from one company. An investigation on a larger scale would be interesting for future research.
Aside from the general usefulness of topic modelling for OE responses, this study's second focus is on the comparison of the four implemented methods. Through the implementation of short text topic models, it was possible to achieve better results than produced by the benchmark method LDA. BTM mainly achieved the best performance for topic coherence and WNTM for document classification. LFLDA produced very similar results to LDA and has the disadvantage that it depends on the availability and quality of external data. Finding a suitable external corpus is an additional effort required by LFLDA. While in this study the vocabulary is almost entirely represented by the chosen vector set, this could be an additional challenge for studies with a very domainspecific vocabulary. The studies that contributed the short text topic models considered here compare their respective innovation to LDA (Nguyen et al., 2015;Yan et al., 2013;Zuo et al., 2016). On the other hand, systematic comparisons of several short text topic models to one another are scarce. Therefore, our analysis of alternative short text topic models in the specific context of OE response processing expands the body of knowledge with original empirical evidence, which may be regarded as a more general contribution to the academic literature.

Conclusions
OE questions enjoy great popularity in market research studies but are associated with a very laborious and error-prone analysis called human coding. In this paper, we investigated the potential of four different topic models to be used as an alternative for human coding. Although it depends on the practical requirements whether topic modelling can replace the traditional approach, the study shows that topic models are very helpful for data exploration as well as topic ranking. Especially the dedicated short text topic models BTM (Yan et al., 2013) and WNTM (Zuo et al., 2016) achieve promising results. This provides a starting point for further research. Notes 1. In the following, the corpus size refers to the number of documents. 2. To avoid misunderstanding, the corpus-topic distribution in BTM is labeled as α B in the following. 3. In the following, method refers to the four topic modelling approaches implemented in this work (LDA, LFLDA, BTM, WNTM) while model refers to each fitted model instance of the methods with e.g., different hyperparameter settings. 4. The vector set is downloaded from https://nlp.stan ford.edu/projects/glove/. 5. Note: The numbers in this figure do not add up to 100% as a document can be assigned to multiple labels. 6. Source code of LFLDA: https://github.com/datquocn guyen/LFTM. Source code of BTM: https://github. com/xiaohuiyan/BTM. Source code of WNTM: http://ipv6.nlsde.buaa.edu.cn/zuoyuan. 7. As a reminder: The experts do not know which topic belongs to which method.

Disclosure statement
No potential conflict of interest was reported by the authors.

Appendices Appendix A. Label descriptions
The table provides a short description for each label.

Appendix B. Model performance metrics
The table provides the coherence score (averaged over all topic) and the F1 score (averaged over all labels) for all models fitted in this study.
(  The platform is easy to use Reach The platform reaches many users Documentation The documentation is easy to understand and the customer support is good Features The platform provides good features Satisfaction The platform is generally satisfactory without further specifying it Must-have The platform is widely accepted and thus inevitable Business Using the platform provides business opportunities Data The platform provides interesting insights Login The login process works well