Selective transfer learning with adversarial training for stock movement prediction

Stock movement prediction is a critical issue in the field of financial investment. It is very challenging since a stock usually shows highly stochastic property in price and has complex relationships with other stocks. Most existing approaches cannot jointly take the above two issues into account and thus cannot yield satisfactory prediction result. This paper contributes a new stock movement prediction model, Selective Transfer Learning with Adversarial Training (STLAT). Our STLAT method advances existing solutions in two major aspects: (i) tailoring the pre-trained and fine-tuned method for stock movement prediction and (ii) introducing the data selector module to select the more relevant training samples. More specifically, we pre-train the shared base model using three different tasks. The predictor task is constructed to measure the performance of the shared base model with source domain data and target domain data. The adversarial training task is constructed to improve the generalisation of the shared base model. The data selector task is introduced to select the most relevant and high-quality training samples from stocks in source domain. All three tasks are jointly trained with a loss function. As a result, the pre-trained shared base model can be fine-tuned with the stock data in target domain. To validate our method, we perform the back-testing on the historical data of two public datasets and a newly constructed dataset. Extensive experiments demonstrate the superiority of our STLAT method. It outperforms state-of-the-art stock prediction solutions on ACC evaluation of 3.76%, 4.12%, 4.89% on ACL18, KDD17 and CN50, respectively.


Introduction
Stock movement prediction can be regarded as a classification task, e.g.whether the price will go up or down at the next time step based on the past stock market data and events. Stock market is a complex and dynamic system containing a lot of stochastic noises, thereby leading to the difficulty of an accurate prediction of stock movement. Although the stock price does not follow a random walk process (Lo & MacKinlay, 1988), predicting the future tendency of a stock has been proven to be a highly challenging problem. There are many factors (e.g. company news, industry performance, investor sentiment and economic factors) that may affect the movement of stock price. Nonetheless, stock movement prediction has always caught the attention of many investors and researchers since the successful prediction can yield huge profits.
In the literature, there have been a lot of efforts to learn and predict stock movement (Qin et al., 2017;Xu & Cohen, 2018). However, these methods may suffer from weak generalisation due to the following two factors: (i) the highly stochastic property (Feng, Chen et al., 2019) of stock price makes the model difficult to learn proper signal representations. The highly stochastic property may be caused by company news, industry performance, investor sentiment, economic factors, etc. All these unexpected indicators (i.e. noises) lead to the learned model weak generalisation. (2) These methods usually apply the model to a single or a group of stocks. However, information of other related stocks will be lost if we only use a single stock data. While it will introduce some more noises to the model if we use a group of stocks without proper selection. For instance, stocks MSFT (Microsoft Inc.) and GOOGL (Alphabet Inc.) are in the same industry (technology) and sub-industry (computer software), and the price trends of them are similar in some periods of time, but different in some periods due to their own characteristics. Therefore, applying the model to one or more stocks at the same time without making a selection will also weaken the generalisation ability of the learned model.
To address the aforementioned problem of the highly stochastic stock price, various methods were proposed in the past few years. For instance, Xu and Cohen (2018) and Ding et al. (2015) presented deep neural network models that jointly exploit the event data (news, tweets) for the prediction task. The introduction of extra information can help the stock movement prediction achieve good results. The work (Li et al., 2019) used stacked denoising autoencoders to alleviate the disturbance of market noise and exploited an LSTM layer to extract features. The work (Deng et al., 2017) introduced a fuzzy learning system for input data to reduce noise. However, to introduce more training samples to the target stock, few studies were dedicated to this study. In particular, He et al. (2019) presented a novel approach to stock price movement prediction using transfer learning. The model was trained by incorporating related stocks which have a high similarity to the target stock. Nevertheless, the process of selecting related stocks does not fit well with the transfer model, because the similarity-based selection module is not jointly trained with the transfer model. To the best of our knowledge, no literature presented an effective model to solve both problems with a high-quality data selector under a transfer learning framework for stock movement prediction.
In this paper, we propose a novel Selective Transfer Learning framework with Adversarial Training (STLAT) for stock movement prediction. Overview of the proposed STLAT is provided in Figure 1. The main working process is shown as follows: (i) the source batch from source domain is fed into the Shared Base Model (SBM) to produce the hidden representation e d . SBM is an attention-based LSTM network which is illustrated in Figure 1(a). (ii) Three different tasks, which are shown in the left of Figure 1(b), acting on e d and generating the prediction results. More specifically, the data selector is employed to select data from source batch aiming at introducing more beneficial samples to improve the prediction results of target stock. The source predictor is introduced to measure the performance of source batch. The adversarial training, adding the slight perturbation to e d , is applied to improve the generalisation of the model SBM. (iii) SBM is then updated with the selected training samples from source batch. The target batch from target domain is fed into the previously trained SBM to get the target predictor loss shown in the right of Figure 1(b), which is used to measure the effectiveness of the current data selector. (iv) SBM is next jointly trained by the total loss with the same source batch and the target batch, which is the combination of the source predictor loss, adversarial training loss and target predictor loss. (v) Finally, the pre-trained SBM is transferred to the target model in the fine-tuning process, which is shown in Figure 1(c). All the end-to-end parameters of the target model is trained with the target batch from the target stock. Moreover, the test data of target stock is evaluated on the target model.
The main contributions of this paper are threefold.
• We propose a novel selective transfer learning framework with adversarial training to address the problem of weak generalisation of the model in the field of stock movement prediction. • We introduce two steps within the framework: pre-training and fine-tuning. A data selector is utilised to effectively choose source domain data for transfer learning. Furthermore, adversarial training is employed to improve the generalisation of SBM. • We conduct extensive experiments on two public benchmarks and a newly constructed dataset. Experiments demonstrate that our proposed framework outperforms a number of competitive baselines and achieves state-of-the-art performance.

Stock movement prediction
As a subset of time series predictions, stock movement prediction has attracted both investors and researchers. Several methods used for time series prediction can be roughly divided into the following two categories: traditional linear predicting methods and nonlinear predicting methods. Linear predicting methods include auto regression (AR), moving average (MA), auto regressive integrated moving average (ARIMA) (Saboia, 1977) and so on. However, the financial time series data is non-linear, which makes the traditional linear predicting methods be struggling in accurate prediction. Recently, machine learning and deep learning techniques (D'Angelo et al., 2021;Kamilaris & Prenafeta-Boldú, 2018;Ying, Qian Nan et al., 2021) have been widely applied in many fields. Especially, Jiang (2021), Nti et al. (2020) and Wu, X. et al. (2021) have drawn the growing attention in financial time series prediction tasks due to their capability in nonlinear mapping and generalisation (Chen & Tan, 2021;Feng, Chen et al., 2019;Khuwaja et al., 2021;Tran et al., 2019). For instance, Nti et al. (2021) present a novel multi-source informationfusion predictive framework for enhancing the accuracy in stock market prediction. Lin et al. (2021) present a model to enhance the existing stock prediction with the ability to model multiple stock trading patterns. Meanwhile, several studies exploit extra market information, like event information, social media information into the prediction (Camacho et al., 2021;Ding et al., 2015;Jin et al., 2020;Wang et al., 2021;Xu & Cohen, 2018). For instance, Sawhney et al. (2021) used the valuable rich signals between related stocks' movements to construct the hyperbolic graph for algorithmic trading.  present a framework that framework incorporating multiple data sources and investors' sentiment to predict stock price. Our work falls into the nonlinear predicting methods without extra information.
In the field of stock movement prediction, fundamental analysis and technical analysis are the two basic methods. Fundamental analysis attempts to measure the intrinsic value of a stock by examining related economic, financial and other qualitative and quantitative factors (Bollen et al., 2011;Ding et al., 2015;Xu & Cohen, 2018). Technical analysis tends to take the historical market data to predict its future movement with advanced models in recent years (Feng, Chen et al., 2019;Li et al., 2019;Qin et al., 2017). Our work falls into the technical analysis.

Adversarial training
Adversarial training (Goodfellow et al., 2015) can reveal the defects of models and improve the robustness. The main process of adversarial training is to inject adversarial examples designed by an adversary to increase the robustness of the model. Most adversarial examples are generated by adding small perturbation to clean samples. Previous studies primarily applied adversarial training to image classification tasks (Goodfellow et al., 2015;Miyato et al., 2016;Xie et al., 2017). Recently, the adversarial training has also been used to text classification (i.e. applying perturbations to the word embedding) (Miyato et al., 2017), recommendation (i.e. adding adversarial perturbations on model parameters to maximise the Bayesian personalised ranking objective function) (He et al., 2018), graph node classification (i.e. modifying the combinatorial structure of data) , multi-agent reinforcement learning . Meanwhile, Feng, Chen et al. (2019) explore the potential of adversarial training in stock price prediction, small perturbations are added to the hidden representation to improve the generalisation of the model. The experiments confirm the effectiveness of adversarial training for the stock movement prediction task. Therefore, to address the highly stochastic property of stock price, we apply the adversarial training to the prediction model. Nevertheless, we apply the Kullback-Leibler divergence cost function in adversarial training, which can improve the predictions results compared to Feng, Chen et al. (2019).

Transfer learning
Transfer Learning has been widely investigated in the past years (Cao et al., 2021;Weiss et al., 2016;Ying, Qiqi et al., 2021). Different from traditional machine learning methods, in which training data and test data must follow the same distribution, transfer learning can utilise knowledge from other domains for new relevant domains (Ye & Dai, 2018). Existing research in domain adaptation has been successfully applied to fields including text sentiment classification (Wang & Mahadevan, 2011), image classification (Oquab et al., 2014;You et al., 2015), human activity classification (Harel & Mannor, 2011) and so on. Recently, BERT (Devlin et al., 2019) has also demonstrated the importance of transfer learning from large pre-trained models, where an effective method is to fine-tune models. In the field of time series analysis, since each stock has small data in terms of daily data, transfer learning is the key to help deep learning techniques in a large number of different stock data settings. Recently, He et al. (2019) proposes a similarity-based approach for selecting source datasets to train the deep learning models with transfer learning. It shows that transfer learning can be effectively used for financial time series forecasting. The study (Nguyen & Yoon, 2019) proposes a deep transfer-based framework to predict the stock price movement. It demonstrates the effectiveness of transfer learning and using stock relationship information in helping to improve model performance. However, both these two approaches lack of an effective data selection method. In particular, not all the relational stock data is suited for the movement prediction of target stock.
In this paper, we propose an effective transfer learning framework incorporating an automatic data selection. The process of data selection is not pre-defined like previous studies. To the best of our knowledge, this paper is the first work to explore the potential of data selection which is jointly trained with transfer learning in the field of stock movement prediction.

Problem statement
Stock movement prediction task is to learn a prediction function Y = f (X, θ), where θ is the parameter and Y ∈ (−1, 1) denotes the stock movement. Specifically, given the sequential features X = [x 1 , x 2 , . . . , x T ], where x i ∈ R D and D is the variable dimension, We are interested in predicting the movement at a certain future moment, i.e. predicting the movement of x T+h , where h is the desirable horizon ahead of the current time stamp. In our experiments, h is set to 1, which means predicting the stock movement at next day. Moreover, the training samples are fed into the model through batch level. We denote the specific training sample i in a batch by X i and the prediction result by Y i .

Shared base model (SBM)
As illustrated in Figure 1(b and c), model SBM in STLAT is an attention-based LSTM network. It is pre-trained with the source batch from source domain and fine-tuned with the target batch from target domain. Note that the proposed STLAT can integrate different shared base models. The attention-based LSTM network contains feature representation layer, LSTM layer and temporal attention layer.

Feature representation layer (FRL)
Feature representation layer is a feature learning technique which maps sequential features to latent representation. Formally, we denote the input sequential features of a training sample i in a batch by where d ∈ {s, t} represents the source domain or target domain and T is the latest time-steps. We employ a fully connected layer to map where W f and b f are parameters to be learned. The main reason is that previous work (Wu et al., 2018) shows that a deeper input gate would benefit the modelling of temporal structures of LSTM in next section.

LSTM layer
LSTM (Hochreiter & Schmidhuber, 1997) is one of the most representative variations of recurrent neural network (RNN) architecture. It has been widely used in time series modelling since it can overcome the problem of vanishing gradients and better capture long-term dependencies of time series (Li et al., 2019;Qin et al., 2017). For stock movement prediction, at each time-step, LSTM is applied here to learn a mapping from latent representation f d t and previous hidden state h d t−1 to a hidden representation denoted by

Temporal attention layer (TAL)
Although the LSTM layer can learn long-term dependencies, it is struggling to determine which time instances are more important to the prediction. Recent work (Cinar et al., 2017) shows that the attention-based model can make use of the (relative) positions in the sequential inputs and perform better than the model without attention mechanism. In order to learn the importance of each time instance for stock representation, we use the temporal attention augmented LSTM that maps the hidden representation h d t to the aggregated representation a d . This process can be expressed as follows: where W a , u a and b a are parameters to be learned. Motivated by Fama and French (2012), to generate quantile predictions from hidden states, we concatenate a d with the last hidden representation h s T into the final latent representation denoted by

Predictor
We use a fully connected layer as the predictive function to estimate the classification task of stock movement prediction. Note that, the training samples are fed into the model through batch level, where the batchsize is denoted by n. If the training samples [X 1 , X 2 , . . . , X n ] are from source domain, we denote the prediction results by is the label of stock price movement. Therefore, the source predictor

Adversarial training
Adversarial training (Goodfellow et al., 2015) is a novel regularisation method for classifiers to improve robustness by adding a subtle perturbation to the inputs, it meets with the success in image classification (Kurakin et al., 2017) and text classification (Miyato et al., 2017). They usually add the perturbations to the input image or text embedding directly to learn a robust model. However, it may break the sequential relation of the stock price with different LSTM units thereby causing the uncontrollable input. Considering that the model can capture the noises by enhancing the predictions on different aggregated representation, we add the perturbations r d adv into the final aggregated representation e d of the SBM. Thus the final adversarial example is a d adv = e d + r d adv . The adversarial training follows the cost function Kullback-Leibler divergence motivated by Miyato et al. (2017), where adv is a vector containing the loss of each training sample, r d adv = arg max r, r ≤ KL [p(·|e d ;θ)||p(·|e d + r;θ)] and KL[P||Q] = p(x) log p(x) q(x) denotes the KL divergence between distributions P and Q, is a hyper-parameter to control the scale of perturbation. To approximate the perturbations value, we employ the fast gradient approximation method (Goodfellow et al., 2015), with the L 2 norm constraint in the KL loss. The perturbation and parameters can be updated using back propagation in our proposed SBM as follows: where g can be obtained by

Data selector
After performing prediction task and adversarial task on the source domain, we argue that no all source training samples are beneficial for improving the prediction performance of target stock. Therefore, we propose a data selector to select the beneficial training samples. More specifically, data selector is a fully connected layer on the top of the SBM. Unlike the predictor task, the output of data selector task is a mask vector m s = [m s 1 , m s 2 , . . . , m s n ] containing 1(0)s representing whether to select a training sample from source batch. Figure  1(b) shows how the data selection module affects the source predictor and adversarial training. Formally, if a training sample X s i is selected to update model SBM, the output of m s i is 1. Here, we directly apply m to the loss s and adv to formulate them and simply calculate them by dot product as s · m, adv · m, respectively. More specifically, if a training sample is selected, the corresponding loss will be added in the total source loss. Otherwise, the corresponding loss will be removed. At last, we denote the average of the total source loss by se and it is calculated as follows: where α is the hyper-parameter and se is the average of the total source loss because the number of source training samples are selected differently in the total source loss during each training step. The performance of the data selector is evaluated by the performance of the target batch from target domain. If the performance of the target batch is better after updating the model with the selected source training samples, it can show the effectiveness of the data selector. Here, we add the target predictor loss to guide the learning of data selector. The training details is described as follows.

Pre-training process
At each training step, the pre-training of the proposed model works as follows: for a target stock, (i) the source batch from source stocks is fed into the source predictor task, adversarial training task and data selector task, thereby getting the prediction results Y s , Y adv and m s . (ii) We then update the SBM with loss se in Equation (7). (iii) The same source batch and the target batch from target stock are fed into the source predictor task, adversarial training task, data selector task and target predictor, thereby getting another prediction results Y s , Y adv , m s and Y t , respectively. The reason of steps (ii) and ( iii) is that the target predictor loss should be calculated after SBM trained with selected training samples. (iv) At last, we compute the total objective function in and update all the parameters of the pre-training model. is calculated as follows: where the term α, β and γ are the hyper parameters to balance the different losses. More specifically, se is in Equation (7), adv is in Equation (4). We also introduce a L 2 norm constraint in the objective function.

Fine-tuning
Fine-tuning is straightforward to use the parameters in the SBM, which are initialised after the process of the pre-training, as depicted in the red arrow as shown in Figure 1. Meanwhile, we build a fully connected layer in the above of the SBM as the target model to predict the labels. The training process is to minimise the cross-entropy loss between Y t and Y l of target stock training data. Thus we fine-tune all the endto-end parameters of the target model. Compared to pre-training, fine-tuning is relatively inexpensive, since the parameters are slightly updated to fit the target stock training data. Lastly, we use the target model to evaluate the testing data from target stock.

Training process
Our framework under the transfer learning schema is trained as the following procedure. We adopt the leave-one-out strategy to construct the source domain data for pre-training and the target domain data for fine-tuning. First, for an individual target stock in the dataset, all the remaining stocks except the target stock are utilised as source stocks. Second, in the pre-training stage, for each training batch in source domain stocks, we randomly select target training batch because the number of source training batches is much larger than the number of target training batches. Third, in the fine-tuning stage, we construct the target model based on the pre-trained SBM and fine-tune all the endto-end parameters of the target model. The detailed training process is described in Algorithm 1.

Algorithm 1
The training process of STLAT.
In the preprocessing stage, we create features for modelling the market state for each trading day. Specifically, we utilise the 11 features described in Adv-LSTM (Feng, Chen et al., 2019) and involve 10 auxiliary features. These technical features indicate the moving average or the rate of change of stock market. The details of all these features are elaborated in Table 1.
Finally, at the trading day t, the associated example has 21 temporal features that are labelled with negative or positive according to the movement percentage p ≤ −0.5% or ≥ 0.55%, respectively, where p = adj_close t+1 /adj_close t − 1. This leaves us the total samples of three datasets divided as 48.62% and 51.38% in the two classes. Note that adj_close represents the adjusted closing price that modifies a stock's closing price to exactly reflect that stock's value after accounting for some corporate actions, such as stock splits and stock dividends.

Baselines and metrics
We compare our model with the following baselines: • RAND: a simplest predictor making random guess (up or down) and each direction has an equal probability. • ARIMA (Brown, 2004): Autoregressive Integrated Moving Average models historical prices as a nonstationary time series. • RF (Kumar & Thenmozhi, 2006): A random forest classifier, which makes decision according to features, is a popular ensemble model. • LSTM (Hochreiter & Schmidhuber, 1997): Long Short Term Memory network, a special RNN model. • DA-RNN (Qin et al., 2017): The temporal attention mechanism is applied to the LSTMbased model, we convert the regression predictor of the model to the classification predictor to suit our problem. • SFM : a novel State Frequency Memory (SFM) recurrent network to capture the multi-frequency trading patterns from past market data to make long and short term predictions over time. • Adv-ALSTM (Feng, Chen et al., 2019): It is an attention-based LSTM network which uses adversarial training and performs better than ALSTM. • MAN-SF (Sawhney et al., 2020): It introduces an architecture that combines a potent blend of chaotic temporal signals from financial data, social media, and inter-stock relationships via a graph neural network in a hierarchical temporal fashion, which is the state-of-the-art model. Following previous work on stock prediction (Ding et al., 2015), two standard metrics of accuracy (Acc) and Matthews Correlation Coefficient (MCC) are adopted for evaluating the prediction performance. Higher Acc and MCC indicate better performance.

Training protocol
Different from the baselines, our framework is under the transfer learning schema. We adopt the leave-one-out strategy to construct the source domain data for pre-training and the target domain data for fine-tuning. First, for an individual stock S in the dataset, all the remaining stocks except S are utilised in the pre-training stage. Second, we fine-tune S based on the pre-trained model and obtain the performance on testing set with the best validation performance on the model. The final performance of dataset is an average result of all contained stocks. We train our model's parameters using gradient-based optimiser Adam, with an initial learning rate 0.01 and a batch size of 128. We set 200 epochs for the pre-training stage and fine-tune the model with 200 additional epochs. To avoid overfitting, early stopping strategy is adopted during the training process.

Results
We show the performance of the baselines and our proposed model in Table 2. From the results, we observe that our model achieves both best Acc and MCC on three datasets. The RAND predictor achieves accuracy around 50/50 percent as we expect. LSTM slightly outperforms RF on ACL18 and KDD17. DA-RNN outperforms LSTM, indicating the validity of the attention mechanism. S-TF outperforms LSTM, which shows the effectiveness of data selection though it uses a simple similarity based method to select training samples. SFM introduces the frequency-aware sequential embeddings, which help it outperforms DA-RNN and LSTM. Adv-ALSTM shows better performance, this is because it introduces the adversarial training to the prediction model. Significantly, our model outperforms previous state-of-the-art model MAN-SF by a large margin. The proposed STLAT achieves the best accuracy 64.56 (6.18%↑) on ACL18, 58.27 (7.60%↑) on KDD17 and 58.33 (9.15%↑) on CN50 despite social media, and inter-stock relationships are introduced to MAN-SF, which is the best baseline. The main reason is that we introduce the transfer learning mechanism and make an effective data selector to enhance the prediction. Meanwhile, our model achieves analogical improvement on the performance of MCC. The results demonstrate consistently better performance, which indicates the effectiveness and robustness of our STLAT.
Moreover, to make an elaborate analysis of all the primary components within our proposed framework, we construct some variations of STLAT. The results are shown in Table 2.
Comparing with the full model STLAT, we can observe that eliminating any of the proposed modules would hurt the performance significantly. More specifically, removing the transfer learning (pre-training stage) leads to the largest accuracy decrease on all datasets. This observation shows the advantage of transfer learning in stock movement prediction. Meanwhile, the data selector contributes to the best performance because of its ability to determine which source domain training samples are beneficial for transferring. Furthermore, adversarial training is also a convincing strategy to enhance the generalisation, which constrains the prediction to be consistent among slight perturbation. Note that model STLAT-TL slightly outperforms the Adv-ALSTM. The main reason lies in the adoption of KL divergence loss instead of the hinge loss, which demonstrates the effectiveness of the designed KL divergence loss. The results reveal the effectiveness of each proposed component and the elaborate full framework STLAT.
We also calculate other metrics for further comparison with the best two baselines on the ACL18 dataset. Metrics include specificity, precision, sensitivity and f1 score. The results are shown in Table 3. From the results, we observe that our model achieves the best results on all four metrics among the two best baselines, which demonstrates the effectiveness of our model. Meanwhile, the best baseline MAN-SF does not outperform the Adv-ALSTM model on all metrics. Although due to the pre-training and fine-tuning process, the proposed STLAT architecture looks a bit complicated. For a stock, the training and prediction time of our model is close to other baselines.

Impact of parameters
We further discuss the effect of hyper-parameters on our model STLAT. We first investigate the impact of perturbation control factor. We vary its value as ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} in different scales. As shown in Figure 2, the optimal value of is associated with the best performance around the mid-section of value range. This result indicates that too small or too large perturbation of the adversarial training cannot benefit the robustness of model. Moreover, the performance of three benchmarks has different peak values (in terms of Acc and MCC), implying that stocks in diverse markets have different noise tolerance values.
To demonstrate the impact of the time-step lag size, we vary its value as ι ∈ {3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30}. As shown in Figure 3, the optimal value of ι is associated with the best performance around 8 ∼ 10 on ACL18 and KDD17. This indicates that too small or too large time-step lag size cannot benefit the prediction of model. The main reason is that data from a long time ago has little impact on current forecasting and small size time-step may lose some effective information. Moreover, the performance (Acc and MCC) of ACL18 and KDD17 datasets almost has the same peak value, but the performance of CN50  has a different peak value compared to them, implying that stocks in different markets (U.S and China) have different characteristics.

Impact of data selector
To make an elaborate analysis of the impact of data selector module in our proposed framework, we collect a full stock list in S&P 500, which includes 500 large companies in the United States and contains 11 sectors, 24 industry groups, 67 industries and 156 sub-industries in 2019. Furthermore, S&P 500 is one of the best representations of the U.S. stock market. We have compared four strategies based on our STLAT model for testing the effectiveness of data selector in financial time series forecasting. These strategies are numbered from S 1 to S 4 . All these strategies utilise the leave-one-out strategy to construct the source domain data for pre-training and the target domain data for fine-tuning. The selection of source domain data is described as follows: • S 1 : full stocks without selection • S 2 : selection according to market capitalisation. • S 3 : selection according to sector. • S 4 : selection according to historical similarity.
The first strategy S 1 is to train the STLAT with all stocks in S&P 500. The second strategy S 2 divides all stocks equally into 10 groups by market capitalisation. For instance, for each target stock in the first group, the number of source domain stocks is 49. Similarly, the third strategy S 3 divides all stocks into 24 groups according to the aforementioned partition criterion of S&P 500. The last strategy S 4 selects the top 10% most similar (computed by Cosine similarity) stocks for each target stock. Table 4 shows the comparison of strategies with data selector or without on S&P500. The term S k +DS represents the kth strategy that utilises data selector. The performance in the table is an average result of all contained stocks. From Table 4, we have the following observations: (i) All strategies with data selector achieve the better results compared to those without data selector. This justifies the effectiveness of data selector, which might be due to the selection of the most relevant time series to enhance the prediction. (ii) Improper selection of source domain stocks may lead to a decrease in accuracy when compared S 2 and S 1 . (iii) In all cases without data selector, we find that S 4 achieves the best performance. The main reason is that source domain stocks have a better correlation with the target stock than the other strategies.

Discussion
In this section, we will conduct an in-depth analysis of our proposed method. It contains three aspects, namely main findings, limitations and industrial significance. For the main findings, we found that transfer learning is helpful for stock trend prediction, especially when introducing the most relevant training samples through the data selector module. It can be explained that investors may use the same strategy for different stocks at different times, causing the others stock training samples can be helpful for the specific stock. In addition, adversarial training can also improve the generalisation ability of neural network prediction models, which is helpful for stock movement prediction.
For the limitations, introducing relevant training samples from other stocks can indeed improve the accuracy of stock movement prediction. In this paper, we use an efficient data selector to improve the prediction. However, it lacks interpretability and cannot figure out which trading patterns can be helpful. The interpretability is important to the investment field, and this is also our focus in the future.
For the industrial sense, investors can make or lose to a large extent depending on whether they can make correct predictions about stock movements. Usually, they treat each stock independently and miss a lot of useful information that other stocks can provide. Our STLAT model provides a new perspective to investors and can help them make good investment decisions.

Conclusion and future work
In this paper, we propose a novel framework (STLAT) with selective transfer learning and adversarial training for stock movement prediction. STLAT applies transfer learning to effectively utilise source domain data based on the data selector to filter out useless source data and prevent negative transfer. Adversarial training is also exploited to improve the generalisation against the stochasticity of stock. Experiments demonstrate that the proposed approach outperforms competitors and achieves state-of-the-art results. Ablation and parameter analyses also reveal the effectiveness of our framework.
In the future, we intend to explore the following researches: (i) there exists the right prediction that may make a small profit while the wrong prediction may make a large loss. Thus a forecasting scheme that has achieved a large number of successful predictions can even cause a loss. In the future, we will exploit the solutions towards optimising the target of investment, i.e. selecting the best stock with the highest expected revenue. (ii) This paper only considers historical stock prices to predict the future movement directions while other factors such as market sentiment and political events can also affect the stock movement. In the future, we will leverage external data derived from news, financial and political events for better prediction.