S_I_LSTM: stock price prediction based on multiple data sources and sentiment analysis

Stocks price prediction is a current hot spot with great promise and challenges. Recently, there have been many stock price prediction methods. However, the prediction accuracy of these methods is still far from satisfactory. In this paper, we propose a stock price prediction method that incorporates multiple data sources and the investor sentiment, which can be called S_I_LSTM. Firstly, we crawl multiple data sources on the Internet and preprocess them respectively. These data involve stock historical data, technical indicators, and non-traditional data sources, such as stock posts and financial news. Then, we use the sentiment analysis method based on convolutional neural network for the non-traditional data, which can calculate the investors' sentiment index. Finally, we combine sentiment index, technical indicators and stock historical transaction data as the feature set of stock price prediction and adopt the long short-term memory network for predicting the China Shanghai A-share market. The experiments show that the predicted stock closing price is closer to the true closing price than the single data source, and the mean absolute error can achieve 2.386835, which is better than traditional methods. We verified the effectiveness on the real data sets of five listed companies.


Introduction
With the continuous development of deep learning (Liang et al., 2020) and blockchain (Liang et al., 2020;Liang et al., 2021), stock prediction is also a research hotspot in the financial field. The prediction of stock price or trends has always been a long and arduous task due to its characteristics. Early theories on stock market prediction were divided into an efficient market hypothesis (EMH) and random walk theory. The EMH stated that historical information cannot provide any help for predicting stock prices in next day (Fama, 1964). According to the random walk theory, future stock price prediction was independent of past prices and the stock price changes were random (Malkiel, 1973). Additionally, the EMH theory also believed that the current stock prices are not related to previous prices and would be influenced by financial news. Thus, when an emergency occurs, the prediction results are still far from satisfactory (Li et al., 2017). Later, stock investors and academic researchers found that they could forecast the movement of stocks based on some stock historical data of the listed company. For example, Box and Jenkins introduced the Autoregressive integrated moving average (ARIMA) model in 1970, and many scholars used it as a benchmark method for time series data forecasting (Brockwell & Davis, 2015). Ariyo et al. (2014, march 26-28) proposed a stock price prediction method based on the ARIMA model, and the experimental results demonstrated better results in shortterm prediction. Idrees et al. (2019) analysed the time series data of the Indian stock market and built a statistical model, which achieved efficient stock prediction. Conventional prediction methods are suitable for capturing regular and structured data. When the stock price data is greatly affected by the financial text on the Internet, the prediction accuracy will be significantly reduced. With the rapid development of big data and artificial intelligence, text mining and deep learning have also become research hotspots and focus, which have been also applied to stock prediction. Li et al. (2018, january 29-february 2), Shah et al. (2018, december 10-13), and Yun et al. (2019, september 11-13) explored the impact of emergencies from financial news on stock trends and proved that non-quantitative data has a significant impact on the financial market as well as quantitative data. Moreover, Li et al. (2018, january 29-february 2), Oliveira et al. (2016), Sun et al. (2017, august 10-11), and Wang (2017) also explored the impact of the opinions from social media, and how the investor sentiments affect stock movements. Lin and Hsu (2014) proposed an emerging multi-agent architecture, grounded on cooperative learning, to solve the class-imbalanced classification problem and the results from this study indicate that the presented model performs satisfactorily in risk management. Wang et al. (2020) proposed a mixed utility model to empirically test the relationship among accounting conservatism, corporate governance, comprehensive index, and stock price collapse risk.
In the existing methods, we found that some approaches predicted stock price only based on stock historical data, which seldom introduced unstructured text data in the financial field. Although some methods considered the role of non-traditional data, they only investigated financial news or social media information (Checkley et al., 2017;Li et al., 2018, january 29-february 2;Oliveira et al., 2016;Sun et al., 2017, august 10-11;Wang, 2017;Zhou et al., 2020). To overcome these limitations, our goal is to predict the prices of five stocks in China's Shanghai A-share market with multiple data sources and calculate the error of the predicted prices. We first combine historical stock data, technical indicators, stock forum posts and financial news. Then, we investigate text sentiment analysis based on convolutional neural network (CNN) to calculate the investor's sentiment tendency. Finally, we explored the advantages of long short-term memory (LSTM) to process time series data for predicting stock price. The experimental results show that the proposed method can fit multi-source data well and achieve low error. Our contributions include three aspects: (1) An S_I_LSTM framework is designed by incorporating multiple data sources and investors' sentiment.
(2) Sentiment analysis method based on CNN is proposed to calculate the investors' sentiment index. (3) LSTM network with attention mechanism is proposed to predict stock price.
The rest of this paper is organised as follows. Section 2 reviews the related work. Section 3 presents the proposed scheme. Section 5 discusses the experiments and results, followed by conclusion in Section 5.

Related work
In this section, we will introduce sentiment analysis and give a brief description of CNN and LSTM neural network, which are used in the proposed method.

Sentiment analysis for stock prediction
The conventional method of time series analysis believes that the fluctuation of stocks is unpredictable. However, behavioural finance theories show that the investors' irrational investment behaviour will affect the rise and fall of stock prices (Ritter, 2003). In particular, with the continuous popularity of social networks, the investors often express their views on social platforms, which will affect the mood of other investors and guide other investors' decision-making. Existing research showed that there was a correlation between stock price movements and investor sentiment (Statman, 2011). In addition, some scholars have also found that forum information on social networks (Wang, 2017) and financial news related to the company (Vargas et al., 2017, june 26-28) can also influence stock prices. Thus, the analysis of sentiment tendency of financial text data is valuable in the stock prediction. For example, Porshnev et al. (2013) analysed stock market indicators which integrate lexiconbased Twitter sentiment analysis method and historical data. The experiments showed that the machine learning method can predict the DJIA and S&P500 indicators well. Later, Li et al. (2014) created a new sentiment dictionary to analyse sentiment expressed in financial news, and then, explored the impact of financial news on stock price returns. Sohangir et al. (2018) combined the sentiment analysis of financial Stocktwits into different deep learning-based stock prediction methods, and the experiments showed that CNN is the best model to analyse the stock price. Jiawei and Murata (2019, march 13-15) analysed the sentiment of financial news and verified that market sentiment is a very important factor in stock trends forecasting. Then, Mohan et al. (2019) explored how to obtain high-quality training data sets from financial news to improve prediction accuracy. At the same time, Xu and Keselj (2019, december 9-12) also constructed the tweets dataset for stock market prediction. Their experiment verified the time sensitivity of financial tweet sentiment.

Convolutional neural network
The CNN model was proposed by Krizhevsky et al. (2017) and was widely used in image recognition (Srivastava & Biswas, 2020;Ying et al., 2021). The structure of CNN mainly includes input layer, convolution layer, pooling layer, fully connected layer and output layer. The CNN mainly selects the classified feature set through the convolution kernel in the convolution layer. Kim (2014) firstly applied CNN into text classification and achieved good results. Selvin et al. (2017, september 13-16) used three different deep learning architectures, namely LSTM, RNN and CNN for the price prediction and compared their performances, and the experiments showed CNN-sliding window model perform better and had less percentage error. After that, Lee and Soo (2017, december 1-3) proposed a recurrent convolutional neural network (RCN) that combined the advantages of convolutions sequence modelling, word embedding for stock price analysis and information extraction from financial news. Chen and He (2018) analysed Chinese stock market based on CNN. Sayavong et al. (2019, august 10-11) investigated different layers in CNN, and applied it into stock prediction in financial field.

LSTM neural network
Historical stock transaction data has a strong time correlation, and general deep learning methods do not consider contextual information when processing time series data. LSTM neural network was proposed by Hochreiter and Schmidhuber in 1997 (Hochreiter & Schmidhuber, 1997), which is a variant of recurrent neural network (RNN) and can consider the long and short-term dependence well. The LSTM model includes three gate networks, "input gate", "forget gate" and "output gate", which perform better than RNN (Sundermeyer et al., 2012, september 1). The input gate can decide to put new information in the cell state. The forget gate can determine what to discard from the cell state and what information to retain for solving the problem of gradient disappearance. Finally, the output gate means that can express how much information should be exhibited. Figure 1 shows the LSTM network architecture which can model sequence data well and can combine the information retained by the previous state to process the current task. Due to the efficiency of LSTM in processing time series data, there have a lot of works that adopted the LSTM network for stock market prediction. For example, Li et al. (2017) adopted the LSTM neural network and incorporated investors sentiment tendency for CSI300 index prediction. Vargas et al. (2017, june 26-28)  effectiveness of the CH-RNN. Jiawei and Murata (2019, march 13-15) proposed a stock prediction method based on LSTM network, and the experiment showed that recurrent neuron network with LSTM can handle financial time series data better than traditional time series prediction methods. Eapen et al. (2019) proposed a novel deep learning model that combined multiple pipelines of CNN and bi-directional LSTM units, and the experiments showed that it could improve prediction performance by 9% upon single pipeline deep learning model. Wei (2019, october 17-19) adopted LSTM for predicting the stock price and optimised this network by MBGD algorithm. Xu and Keselj (2019, december 9-12) predicted stock market movements using attention-based LSTM and compared with conventional LSTM for performance.

Methodology
As shown in Figure 2, we propose a framework using Att-LSTM model for stock price prediction based on sentiment analysis and multiple data sources (S_I_LSTM). Following is the detailed description of the three key models: (1) technical indicator calculation model, (2) sentiment index calculation model, (3) stock prediction model.

Technical indicators calculation model
In this section, we will introduce how to calculate the technical indicators, and mainly include historical transaction data preprocessing and technical indicators calculation. A brief introduction about both is given below.

Historical transaction data preprocessing
Before calculating the technical indicators, we need to preprocess the traditional data sources, including clearing and filtering out some meaningless data to improve the quality of data. We choose five stocks of listed companies from EastMoney.com and crawl the historical trading data of five stocks from it. These transaction data include trading day, stock code, stock name, opening price, closing price, highest price, lowest price, adjusted closing price and trading volume. It is critical to remove unnecessary information and leave trading date, opening price, highest price, lowest price, closing price and trading volume.

Technical indicators calculation
The technical analysis method mainly analyses the stock price fluctuation of the company according to the historical stock trading data and charts, and the technical indicators are often used in the technical analysis method. In this chapter, three technical indexes are selected, which are stochastic oscillator index (%K), William index (%R) and relative strength index (RSI). The stochastic oscillator index (%K) reflects the correlation between the price range and the closing price in a given period of time. The William indicator (%R) is mainly used to measure whether the market is oversold or overbought. The RSI is very suitable for short-term volatility of stock prices. These three technical indicators measure the timing of overbought and oversold to a certain extent, and can also affect the decision-making of investors. Usually, investors will buy stocks when the market is oversold and sell stocks when the market is overbought, so these technical indicators can also affect the fluctuation of stock prices. These three technical indicators are calculated based on the historical trading data of stocks. Their calculation is shown in the formulas (1), (2), (3) and (4). The TAlib algorithm, a common library for quantitative trading in Python, is used to calculate the three technical indicators: RS = Average x day s upclosing price x day s down closing price , where C is the close price, the H t and L t respectively denote the highest price and the lowest price for the last t days, RS is the ratio of the average value of the sum of the closing price increase and the average value of the closing price decrease in x days, and the value of x is set by 7 in our experiments.

Sentiment index calculation model
In Section 3.1, we introduced specific stock analysis methods and used three indicators based on technical analysis. In this paper, fundamental analysis method is also used, and this method can be judged based on the internal and external factors of the company, such as politics, news, economy, interest rate, exchange rate, and the listed company's operation status, etc. With the increasing popularity of social media platforms, in line with previous studies results, financial news (Mohan et al., 2019;Yadav et al., 2019) and social network information (Porshnev et al., 2013;Sun et al., 2017, august 10-11;Zhao et al., 2016) can also influence investors' decisions and stock price. Thus, we also incorporate fundamental analysis method to investigate the potential impact of financial news and social platform data on stock prices. The detailed analysis steps involve text data preprocessing, document labelling, text sentiment analysis based on CNN and sentiment index calculation. The detailed introduction of each step is as follows.

Text data preprocessing
Stock price prediction is an extremely complex task because there are many situations that will affect it. Therefore, we need to capture these preconditions as much as possible, then we can understand the stock characteristics from many aspects to achieve better prediction results. In the unstructured financial text data, this paper selects financial news and stock forum posts related to five listed companies and uses web crawler technology to collect enough data sets. According to literature (Vargas et al., 2017, june 26-28), we can know that the news headline can obtain more information than the news text. Therefore, we also take news headlines as the research object. Text data preprocessing is a key step in sentiment analysis and high-quality data sources can be obtained after that. Most objects in natural language processing are sentences, so sentences need to be divided into single words. The punctuations are also removed because they have no practical meaning in sentences. In English text, spaces between words are used as separators, but for Chinese, only sentences and paragraphs have obvious separators, and there is no space between words. We use Jieba word segmentation algorithm to segment unstructured data in the financial field. Jieba is a powerful word segmentation library implemented in Python language.
We use the precise model of this method to segment every news title and stock forum post in the financial text data set. Then the stopwords dictionary of Harbin Institute of technology is used to filter out the words that have no actual effect in sentences and acquire the word set after the segmentation.

Document labelling
The purpose of sentiment analysis is to mine the opinions expressed by investors in speech or text, and it can be divided into three research levels: word level, sentence level, and chapter level. Sentiment classification based on chapter level can analyse an overall sentiment tendency or polarity. Before using deep learning for sentiment analysis, we need to label each news and forum post information to collect training data sets. Document labelling is to classify financial text into positive and negative based on historical transaction data. The literature (Minh et al., 2018) mentioned that there are two ways to label news articles based on historical data: open-to-close (daytime return), close-to-open (overnight return). According to the literature (Wang et al., 2009), the daytime return contributes more to the total return than the overnight return. Thus, we also adopt the technology based on open-to-close price. The open-to-close price return is calculated as follows:

News and forum posts label
where O a+t is the opening price of day a after day t, which is a period. For example, if day a is October 1th, 2019 and time period t is 2, then O a+t is the opening price on 3 October 2019. C a is the closing price of the stock on day a. If R at is greater or equal to 0, then the news and forum posts on that day are classified as positive, whereas if R at is less than 0 then the news article and forum posts are labelled as negative.

Sentiment analysis based on CNN
The sentiment index is to calculate the overall sentiment tendency of the public in everyday and is calculated based on the classification results of non-traditional data. However, before calculating, we still need to explore whether the classification of each news or forum post in the document labelling in the previous step is correct or not. Since the structure of the CNN model for text classification is different from that used for image classification, we merge the word vector which is initialised based on Word2vec into the input layer of the CNN model. Word2vec is also a continuous word embedding learning model and has two training models: CBOW (continuous bag-of-words) and skip-gram models (Mikolov et al., 2013). The obvious difference between the two is to predict the current word or the context. In our paper, skip-gram can pre-train the high-dimensional text representation of each word in the sentence based on the initial stock corpus. Then the word vector will be used as the input for CNN. The CNN model structure used for sentiment analysis of financial texts mainly involves embedding layer, three convolutional layers, max pooling layer, full connected layer, "softmax" layer and output layer. Figure 3 shows the structure of CNN model. For example, "Sales increased due to growing market rates" is a sentence. Feature vectors trained based on Word2vec are used to initialise the embedding layer. The word vectors obtained by word2vec are reconstructed into an n×k sentences vector matrix at the embedding layer. The fixed length of the sentence is set as n, sentences with length less than n are padded with 0. The k is the word vector dimension. Each word has a word vectors with a fixed dimension. Following is the convolution operator to extract text features and we use a convolutional layer with a convolution kernel to extract word vector features.
In the training process, we also used the dropout method, which was proposed by Hinton et al. (2012) to prevent the model from over-fitting. The pooling layer has the same function as the convolutional layer, except that the pooling layer selects the maximum value of the area as a feature. And the classification layer consists of the fully connected layer and  "softmax" classifier. In this way, every day's news and forum posts are classified based on CNN and can verify the classification correctness of the individual. Table 1 shows the inner structure of text CNN for sentiment analysis.

Sentiment index calculation
In social media or news platform, after an event occurs, the public will have some comments on the company, and news articles will also report. These texts have an impact on the stock movements to a certain extent. After the sentiment analysis of financial texts, each news or post is classified as positive or negative. However, this is only the result of text classification. We need to calculate the overall sentiment tendency of the public in a day based on the number of positive and negative texts. According to the literature (Li et al., 2017), the sentiment index is calculated based on the ratio of the sum and the difference between the number of positive and negative texts, which can obtain highly accurate feature set. We follow their method and construct emotional feature sets based on non-traditional data sources. The sentiment measure is defined as the following formula: where M tpos is the total number of positive news and forum posts on day t, and M tneg is the total number of negative news and forum posts on day t. The range of sentiment index is between −0.5 and +0.5, and the sentiment index below 0 means that the sentiment is negative on the t day.

Constructing the matrix of prediction feature sets
We want to explore whether investor sentiment and technical indicators affect stock price movements. The selected feature sets contain nine-dimensional vectors, which are the open price, highest price, lowest price, close price, volume, sentiment index, stochastic oscillator (%K), William (%R), RSI. They are calculated according to the above formula (1), (2), (3), (4) and (7). We show the data of the last five trading days in the test set in Table 2 and the output of the prediction model is the closing price.

Stock price prediction based on Att-LSTM
We regard the problem of stock price prediction as a regression problem not a classification problem. When we model data sets by using a deep neural network, the input label set is the closing price, and the predicted result is also the closing price. In detail, for a given data time t, a given stock s, we can predict the closing price C s t of stock s on day t. Our prediction feature set is the result obtained by fusing two modules, which involve three dimensions of technical indicators analysis, one dimension of sentiment index, and five dimensions of historical transaction data. As shown in Table 2, the prediction data set matrix which we constructed has obvious chronological order, thus we adopt LSTM model for stock closing prediction. For sequence data with long time intervals and delays, the LSTM network also can capture the relationship on them. According to literature (Kraus & Feuerriegel, 2017), the LSTM networks are universally used in the financial field. As shown in Figure 4, the structure of LSTM model for stock price prediction consists of four parts: input layer, LSTM layer, attention layer and output layer. In the training stage, the inputs to this model include: open price, highest price, lowest price, volume, sentiment index, stochastic oscillator (%K), William (%R), RSI, and the closing price is a label. However, in the testing stage, there are only eight dimensions of data and these data are obtained by fusing the previous two modules. The selection from the matrix of the input layer is shown in formula (8), where X s t is the input matrix at time t of the stock s. h s t−1 is the hidden output of the LSTM layer of the stock s at t−1 Figure 4. The structure of S_I_LSTM model for stock price prediction. (None,20,32) 0 Flatten_1 (None,640) 0 Dense_2 (None,2) 641 movement, which was 0 at the beginning, and the W i and b i donate the weight matrices and the bias corresponding to input gate respectively. The σ symbol represents the "sigmoid" function. Next is the LSTM layer, which is composed of many LSTM neurons. Each neuron will process the new input information and the information left by the previous neuron to determine which information can be output and which information will input into the next neuron. Formula (9) and (10) are to calculate the hidden output H s t of the LSTM layer, that is h 1 , h 2 , ... , h n in Figure 4. C t is the cell state when each LSTM neuron is calculated. Where W c and b c donate the weight matrices and the bias respectively. And then, the output of the LSTM layer is multi-dimensional, and the fully connected layer is the process of dimensionality reduction, which is converted into a suitable output. The activation function used in this layer is "sigmoid". However, for predicting the closing price, the contribution of the input to the output at each moment on the day is different. We added an attention layer after the LSTM layer and selected the LSTM layer feature with a higher/lower weight. The last is the output layer, where the output is the predicted closing price. Table 3 shows the internal structure of the LSTM network using the attention layer.

Data description
In our experiments, we selected five listed company stocks from EastMoney.com as the research object. Table 4 shows the detailed information. The dataset used in this work consists of 2351 news articles and 33,500 forum posts in 3377 transaction days from East-Money.com, which is a professional internet financial media in China, corresponding to the period from 01 July 2017 to 30 April 2020. At the same time, it also crawls the historical stock trading data of the same period. The traditional stock historical data source contains 3377 rows, and each row contains six columns of data including the trading day. Vargas et al. (2017, june 26-28) verified that the prediction results of the features extracted from news headlines are more useful than news content. According to this conclusion, only the headlines of financial news are left in the collected data set we collected. In addition, we select the data from 01 July 2017 to 31 December 2019 as the training set and the data from 01 January 2020 to 30 April 2020 as the test set.

Performance evaluation metric
Future prediction methods can be divided into two categories, movement prediction and future value prediction. In the stock market prediction, according to different future prediction, it can be divided into classification problems and regression problems, and different problems will have different evaluation indicators. In this paper, we view stock prediction as a regression problem, and can predict the close price. The most commonly used performance measurements have mean absolute error (MAE), mean square error (MSE) and root mean square error (RMSE). Where f is the predicted value and y is the true value:

Experimental results and performance comparison
In this section, the effect of the four data sources on five stocks will be verified based on S_I_LSTM prediction model. We use formula (11), (12), (13), three evaluation indicators to measure the performance on the test set under different conditions. We have conducted several experiments as follows.

Comparison of the effect on multi-source datasets and single source dataset
First, since people are often irrational when investing, it is difficult to rely on a single traditional data to simulate stock trends or prices. Therefore, it is necessary to compare the potential impact on the stock closing price of historical transaction data, technical indicators, financial news, and stock forum posts. The results are shown in Table 5. It can be seen that when only single data source is used as feature set to predict the closing price of a stock (for example, only historical transaction data, only technical indicators or only sentiment index), the result of MAE is higher than multi-source data set. Moreover, when only considering the sentiment tendency of investors, the MAE value is higher than the error value when only historical transaction data is used. This also confirms the original assumption that the investment trend of investors can only assist their decision-making, and cannot refer to investment. The sentiment index also needs to combine historical transaction data to achieve better prediction results, because investors are not all rational investments. When using multiple data sources, the average absolute error can reach 2.386835, which is lower than the other three single data sources, which are 0.082532, 0.072659 and 0.120507, respectively. However, when compared with MSE results, the result of predicting the stock closing price based on technical indicators is lower than other indicators, it is indicated that technical indicators are also more important influencing factors, and the prediction results need to comprehensively consider the error values of the three measurement indicators. In short, through this experimental comparison, it can be concluded that investors are more inclined to use multi-source data to assist their decision-making.

Comparison with other similar methods
Secondly, it is necessary to verify the proposed S_I_LSTM model, which contains stock historical data, technical indicators and sentiment analysis results. The results are shown in Table 6. It can be seen that when multiple data sources and sentiment analysis are combined to predict the closing price of stocks, the MAE based on S_I_LSTM can reach 2.386835. Compared with the literature (Zhang et al., 2019), our method's MAE is reduced by 0.654165. They employed the successful experience of generative adversarial networks (GAN) and proposed a structure for predicting stock prices. Their structure adopted multilayer perception (MLP) as the discriminator and LSTM as the generator. Jin et al. (2019) proposed a stock market prediction model based on LSTM, in which the investor's sentiment tendency and EMD stock history data decomposition method were all taken into account. Their method also considered the investor's sentiment index, but only single non-traditional data set of the investor's comment data on the Stocktwits website was considered. Then the LSTM neural network combined with sentiment analysis results is used to predict stock closing price. Therefore, the MAE result of their method is 0.009286 higher than the result of the method proposed in this paper. Experiments with the method proposed in this paper on real data sets of five listed companies in China's A-share market show that the combination of historical stock data, technical indicators, financial news and social media data and other multi-source data sets can help investors make decisions. At the same time, the S_I_LSTM stock prediction framework proposed in this article can also prove that investor sentiment and technical indicators are factors that affect stock price trends.

Comparison of the five stocks of listed company
Finally, we compared the results of predicting the closing price of stocks of five listed companies through experiments, and the results are shown in Figure 5. In the figure, the x-axis represents 80 days of multi-source data in the test set, and the y-axis represents the stock closing price of listed companies. The red solid line represents the actual closing prices of Figure 5. The five companies' closing price prediction results.
the five listed companies, and the blue dotted line labelled "S_I_LSTM" is an LSTM model with sentiment index and technical indicators. It can be seen that at the beginning of the test period, the model's ability to fit the data is very poor. It can be clearly seen from the subgraph(a) that from the 1st day to the 5th day, the predicted closing price start from a small number to rise, it shows that the model is constantly capturing the law of stock changes every day. And from subgraph(b), it can be seen that the predicted closing price is directly higher and begins to fall over time. It also shows that at the beginning during the period, the amount of data is small, and the ability of the model to capture features is insufficient, but after the fifth day, as the amount of data gradually increases, the predictive ability of the model continues to improve. Moreover, these five listed companies have a simultaneous downward trend in 50-60 days of the test period. This is because of the new coronavirus pneumonia (COVID-19) that occurred in early 2020 that caused a severe decline in the stock market. During this period of time, there were many negative news about listed companies, and stock price prediction also showed a downward trend. By comparing the effectiveness of the S_I_LSTM prediction framework on the individual stock data set, it is found from subgraph (a)-(e) that the model has a good fit for the multi-source data set of five listed companies, and it also confirms that the impact of company-related financial news and forum posts on prediction results. What's more, our research can be made into a stock price prediction system. This system can predict the ups and downs of stocks in advance and can assist investors in making decisions. And it allows investors to trade stocks at the right time to assist investors in making decisions.

Conclusions
In this paper, a novel framework of S_I_LSTM model was proposed for stock price prediction. We discussed the impact of traditional data sources (stock historical transaction data and technical indicators) and non-traditional data sources (stock posts and financial news) on stock price predictions. Moreover, we also investigated whether the technical indicators have an impact on stock predictions. We proposed a deep learning method to analyse China's Shanghai A-share market based on multiple data sources. The proposed method incorporated investor sentiment and technical indicators into the stock price prediction. At the same time, the proposed method could provide investors with investment advice, which can be used to guide actual investment and had certain practical significance. However, due to the lack of time and our research ability, this paper still has certain limitations and needs to be further improved in future research. For example, there is less consideration in training labelled data. Also, the cycle or granularity of stock prediction are all factors that will be considered, which can increase the actual research value.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by Natural Science Foundation

Data availability statement
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.