Feature recommendation strategy for graph convolutional network

Graph Convolutional Network (GCN) is a new method for extracting, learning, and inferencing graph data that builds an embedded representation of the target node by aggregating information from neighbouring nodes. GCN is decisive for node classification and link prediction tasks in recent research. Although the existing GCN performs well, we argue that the current design ignores the potential features of the node. In addition, the presence of features with low correlation to nodes can likewise limit the learning ability of the model. Due to the above two problems, we propose Feature Recommendation Strategy (FRS) for Graph Convolutional Network in this paper. The core of FRS is to employ a principled approach to capture both node-to-node and node-to-feature relationships for encoding, then recommending the maximum possible features of nodes and replacing low-correlation features, and finally using GCN for learning of features. We perform a node clustering task on three citation network datasets and experimentally demonstrate that FRS can improve learning on challenging tasks relative to state-of-the-art (SOTA) baselines.


Introduction
GCN is a powerful graph tool that can be applied to arbitrary structured graph data. It has various application scenarios, such as computer vision (X. Zhao et al., 2021), knowledge graphs , traffic prediction (N. C. Zhao et al., 2022) and gait recognition . Wu et al. (2020) proposed to combine GCN with Markov Random Fields (MRF) to design a spammer detection model that considers the impact of multiple email senders on social network graphs. Shang et al. (2019) fully benefited the node structure, node attributes, and edge relationship types in the knowledge graph by uniting GCN and ConvE to accomplish accurate node embedding. Zheng et al. (2020) designed a Graph Multi-Attention Network (GMAN) to forecast traffic conditions at various points in the road network graph for the next few time steps. In addition, GCN can also be adopted in computer vision. For example, Huang et al. (2020) explored applying GCN to learn high-level relationships between body parts for skeleton-based action recognition. J.   introduced multi-relational GCN to recognise images of a given dish with zero training samples to accomplish the automatic diet evaluation task. The most usual approach in current research is to take full advantage of label information, node characteristics, and network topology. For example, Qin et al. (2021) used the given label and the estimated label to learn the task. Feng et al. (2021) focussed on cross-features and designed an operator called Cross-feature Graph Convolution, which can model cross-features of arbitrary order. L. Yang et al. (2019) investigated the correlation between node features and network topology and proposed a topology Optimised Graph Convolutional Network (TO-GCN).
In general, GCN is mapping the original features of nodes into the low-dimensional space by multiple graph convolution layers. As illustrated in Figure 1, a classical GCN model usually has two parts: node aggregation and feature transformation. The former enhances the representation of the target node by fusing the information of the surrounding neighbouring nodes. The latter converts the input features into a better description of the node's feature representation.
Current research focuses on developing various aggregation methods for different connection characteristics. For example, Kipf and Welling (2016) proposed local node similarity, and Donnat et al. (2018) considered structural similarity. The transformation of node features is vital for the research, such as feature cross-fusion (Feng et al., 2021) and random cover features (Zhu et al., 2020). In addition, there are many feature-related works. For example, Liu et al. (2020) proposed Deep Adaptive Graph Neural Network (DAGNN) by decoupling feature transformation and information propagation entanglement. M.  developed an extended model of vanilla GCN that keeps input information with constant mapping. Liu et al. (2021) captured remote dependencies from node features with a non-local aggregation. X. Yang et al. (2021) proposed a self-supervised semantic aligned convolutional network (SelfSAGCN) to investigate the semantic information in features. We argue that the existing GCN approaches ignore the potential features of nodes. Although the existing GCN has an extraordinary ability to capture features, we believe that encoding potential features by a feature recommendation strategy can significantly improve the learning capability of the model. For example, we focus on the features of users' electronic purchase records: UserA = {Camera, Phone, Computer, Watch}, UserB = {Phone, Computer, Gamemachine, Watch, DVD}. After the feature recommendation, user A's most likely additional feature is the Game machine, while user B is not likely to have other features besides his own. The feature of the last two users describes as UserA = {Camera, Phone, Computer, Watch, Game machine}, UserB = {Phone, Computer, Game machine, Watch, DVD}. As shown in Figure 2, a feature recommendation module can capture user-item interactions and opinions to obtain potential features of users and make more accurate product recommendations (Fan et al., 2019).
In this paper, we propose the FRS, which is inspired by the social recommender system (GraphRec) (Fan et al., 2019) and was initially proposed to solve the problem of encoding heterogeneous graphs (user social graph, user-item graph) in the social recommendation. We regard nodes and features as users and items and use a feature recommendation strategy to enhance the learning ability of the GCN model. Two methods are explicitly mentioned in the feature recommendation strategy. The first method is Maximum Possible Feature Recommendation (MPFR) means obtaining the maximum possible features of nodes with the feature recommendation module. The second is Low-Weight 1 Feature Replacement (LWFR), where the low-weight features are replaced with the maximum possible features by sorting the personalised weights of the features.
We implemented GCN+FRS performance on three citation network datasets for the node clustering task, and the experimental results demonstrate that both MPFR and LWFR methods utilised by FRS can improve the learning ability of GCN.
In summary, our contribution is twofold: • We propose a feature recommendation strategy, including maximum possible feature recommendation and low-weight feature replacement methods, that can improve the performance of the GCN model. • Experimentation with node classification as a learning task on three publicly available datasets demonstrates that GCN+FRS significantly outperforms SOTA methods.
Organisation of our paper The rest of our paper is organised according to the following structure. Section 2 outlines the background of graph convolutional networks and feature recommendation, respectively. Section 3 provides the necessary a priori knowledge.
Section 4 summarises the overall model framework and describes each module's specifics. Section 5 reports the experimental results and experimental analysis. Section 6 concludes the paper.

Graph convolutional network
Graph convolutional network has become an essential tool in graph data analysis tasks, and their mainstream methods can be divided into two kinds of spatial convolution and spectral convolution. The spatial convolution-based approach defines the convolution operation on the spatial relationship of each node, learning and updating the representation from the neighbouring nodes. For example, Gilmer et al. (2017) proposed Message Passing Neural Network (MPNN) views graph convolution as a message-passing process where information can be passed directly from one node to another along an edge. Hamilton et al. (2017) suggested an inductive GraphSAGE method for transductive network representation learning. The method utilises both feature information and structure information of nodes to map graph embedding and save it, which is more scalable. Atwood and Towsley (2016) designed a propagation-convolutional neural network that adopts a matrix representation of H-hop for each node (edge or graph). Each hop represents the neighbouring information of that neighbouring range. The network can obtain local information better. Monti et al. (2017) developed a hybrid model (MoNet) that generalises the traditional CNN to non-Euclidean spaces (graphs and pops) and can learn local, smooth, combinatorial task-specific features. In addition, Veličković et al. (2017) employed an attention mechanism for the weighted summation of features of neighbouring nodes, which is used to address two critical drawbacks. The first is that the features of neighbouring nodes are closely linked to the graph's structure, limiting the model's generalisation ability. The second is that the model assigns the same weights to various neighbouring nodes in the same order.
Unlike the spatial convolution-based GCN, the spectrum-based GCN method implements the convolution operation on topological graphs through graph theory. In the first place, Bruna et al. (2013) developed a spectral convolutional neural network (Spectral CNN), but the model suffers from computational complexity, nonlocal connectivity, and too many convolutional kernel parameters to scale to large graphs. To address the above challenges, Defferrard et al. (2016) designed CheNet to reduce the computational complexity by constructing the filter as a diagonalised eigenvector approximated with Chebyshev polynomials, but the computational effort of the eigenvalue decomposition of the Laplace operator is still enormous. Then based on the previous work, Hammond et al. (2011) showed that the operator could be fitted by a Kth truncated expansion of the Chebyshev polynomial. Finally, Kipf and Welling (2016) proposed the first-order approximation of ChebNet, a simple and effective layer propagation method obtained by simplifying the computation through the first-order approximation method. The method has the advantages of weight sharing, local connectivity, and perceptual field proportional to the number of convolutional layers. In addition, some approximations to Chebyshev polynomial methods are now proposed for performing local polynomial filtering. For example, Levie et al. (2018) suggested employing Cayley polynomial approximation filters, and Liao et al. (2019) proposed multi-scale feature encoding to break the computational bottleneck of existing models.

Feature recommendation
The user-item interaction is a specific graph data in recommendation system tasks, and the utilisation of graph convolutional networks to solve recommendation problems is already a relatively common approach. We simplify the user-item interaction to two relationships, i.e. whether or not the item is owned. For recommender systems, we can regard the recommendation task as which new features the nodes are most likely to have, i.e. the recommendation system is transformed into a feature recommendation system. Berg et al. (2017) proposed a graph auto-coding framework that generates potential representations of users and items by passing distinguishable messages on the graph structure to accomplish the link prediction task. Bian et al. (2020) exploited the rumour propagation directed graph with top-down rumour propagation to learn how rumours spread and designed a diffusion propagation graph GCN with opposite directions to capture the rumour diffusion structure and learn to feature representation for rumour detection. Wu et al. (2020) suggested a novel social spammer detection model that explicitly considers three types of neighbour structures to determine the most likely features of spam. Furthermore, the feature interactions proposed by Feng et al. (2021) can also be regarded as employing existing features to combine into new features.
Notably, the heterogeneity of relationships is also the focus of the research (Chang et al., 2021). X. Wang et al. (2021) proposed a cross-view comparison mechanism for heterogeneous graph neural networks (HGNN). Zhang et al. (2019) integrated the heterogeneous structure information and the heterogeneous content information of each node to jointly learn the node representation in the heterogeneous graph. X. Wang et al. (2019) suggested an HGNN on a hierarchical attention mechanism and generates node embeddings by a hierarchical approach. There is also some other work on heterogeneous relationships (J. Lian & Tang, 2022). Despite the convincing success of these works, little attention has been paid to the application of feature recommendations to GCNs. In this paper, we attempt to utilise a feature recommendation strategy for generating new features of nodes, which will enhance the representation capability of GCN models.

Preliminary knowledge
We start by introducing some of the symbols utilised in the following sections. We denote vectors and matrices with bold lowercase letters (e.g. x) and bold uppercase letters (e.g. X). Note that all vectors are in column form without particular specification, and X ij represents the elements of the ith row and jth column of matrix X. Finally, we use to denote elementby-element multiplication. Some of the terms and symbols commonly used in this paper are given in Table 1.

Graph convolutional network
A graph G usually consists of an adjacency matrix (A ∈ R n×n ) and a feature matrix (X = [x 1 , x 2 , . . . , x N ] T ∈ R n×d 0 ), where n is the number of nodes and A indicates a connection between two nodes. If A ij = 1 means an edge exists between node i and node j, otherwise, A ij = 0. d 0 denotes the dimensional size of the input features. The basic process of GCN is to map the input graph into the low-latitude space through the hidden layer, then learn the Table 1. Symbols used in this paper.

Symbols
Descriptions features matrix of a graph with N nodes. A ∈ R n×n adjacency matrix of a graph with N nodes. neighbour (x) neighbouring nodes of a node x. N(i) the set of 1th-order neighbouring nodes of a node. AGG node aggregation function. FT feature transformation function.
dimension of the node representation. U, V set of nodes and features. O set of known association relations. T set of unknown association relations.

C(i)
set of features of a node.

B(j)
set of nodes containing a particular feature.
nonlinear activation function.
embedding representation of the nodes, and finally connect the output layer that performs the learning task (e.g. node classification). To simplify the notation and representation, we describe the content in terms of a single-layer GCN. As illustrated in Figure 1, the GCN performs node aggregation operation on the input and then achieves the node representation with feature transformation. The entire procedure can be described as: where AGG denotes the aggregation function on neighbouring nodes, x agg ∈ R d 1 indicates the potential features of the node as a result of the AGG operation, FT represents the feature transformation function, and h out ∈ R d 2 is the final obtained rich representation of the target node.

Node aggregation
Node aggregation is to update the information of the target node by aggregating that of the neighbouring nodes. The basic principle is that neighbouring node feature information reflects that of the target node. For example, in a citation network, a target paper with crosscitation relationships with multiple papers in a field may be in the same field. Notice that neighbour(x) can contain self-connections (form x to x), and sampling can fetch all neighbouring nodes (Kipf & Welling, 2016) or a fixed number of random neighbouring nodes (Hamilton et al., 2017). Furthermore, different AGG methods capture different information about neighbouring nodes (Duan et al., 2021;Hamilton et al., 2017;Kipf & Welling, 2016;Xu et al., 2018), such as obtaining common attributes of nodes filtered by a fixed number of neighbouring average pools, and maximum pools obtain the most salient features among nodes.

Feature transformation
Feature transformation is an operation to obtain a potentially rich representation by projecting the target node features into the high-latitude space. Existing GCNs usually adopt nonlinear activation mappings: where W ∈ R d 2 ×d 0 represents the weight matrix, b ∈ R d 2 denotes the bias vector, and σ indicates the activation function.

Feature recommendation
We employ the adjacency matrix and the feature matrix as inputs to the feature recommendation module to predict the maximum possible features of the nodes. Let U = {u 1 , u 2 , . . . , u n } T and V = {v 1 , v 2 , . . . , v m } T represent the set of nodes and features, respectively, where m and n denote the number of features and nodes. Assume that R ∈ R n×m denotes the association matrix of nodes and features (i.e. the feature matrix X). Here we define the association relationship with O} denotes the set of unknown association relations. Let N(i) be the set of first-order neighbouring nodes of u i , C(i) is the set of features of u i , and B(j) represents the set of nodes containing feature v j . Furthermore, the adjacency matrix is expressed with A ∈ R n×n . A ij = 1 if there is a relationship between two nodes u i and u j , otherwise A ij = 0. Knowing the feature matrix R and the adjacency matrix A, we aim to predict the potential features of each node and obtain the maximum possible features of the nodes. Also, we represent node u i with embedding vector p i ∈ R d 0 and feature v j with embedding vector q j ∈ R d 0 .

Feature recommendation strategy for graph convolutional network
This section begins with a general introduction to the FRS framework, followed by detailed descriptions of the feature recommendation module and the low-weight feature replacement module, respectively, and finally explains how the GCN module performs feature learning.

Overview of our proposed framework
The framework of our proposed model is illustrated in Figure 3, which consists of three modules: feature recommendation module, low-weight feature replacement module, and GCN module. The first module is feature recommendation, i.e. recommending the potential features of each node. Since the recommendation module has two different inputs (adjacency matrix and feature matrix), we will learn the representation of nodes from different perspectives. The second module primarily completes the low-weight feature replacement. After the association weights between nodes and features are obtained from the recommendation module, we use the maximum possible recommended features to replace the features of low-weight connections. Finally, there is the GCN module, where we input the original adjacency and the processed feature matrices, learn the representation of the nodes, and complete the final classification task.

Feature recommendation module
The overall structure of the feature recommendation module is illustrated in Figure 4, which consists of three components: node modelling, feature modelling, and relationship prediction. The first part is node modelling, i.e. learning the latent factors of the nodes. Since the input contains two different matrices (adjacency matrix and feature matrix), we will learn the representation of the nodes from two perspectives. Therefore, two aggregation operations are introduced to handle the corresponding matrix inputs. One is feature aggregation, which understands nodes by feature matrix. That is, whether there is an association relationship between nodes and features. The other is neighbour aggregation, the relationship between nodes and nodes. It can help us model a node from the perspective of its neighbour relationship. The second component is feature modelling, which is learning about the potential factors of features. Having already understood the nodes from the feature perspective, we modelled the features from the node perspective (which nodes are a feature owned by) to thoroughly understand the connection between nodes and features. The final component integrates the previous two components to learn the model parameters through feature prediction. We will describe each component in detail below.

Node modelling
The purpose of node modelling is to learn the potential factors of nodes, and the challenge is to figure out how to combine the adjacency matrix and the feature matrix to represent node u i as h i ∈ R d 3 . To better learn the node representation, we designed two types of aggregation operations: feature aggregation and neighbour aggregation. As shown in the left part of Figure 4, feature aggregation learns the potential representation of nodes h F i ∈ R d 4 through the feature matrix, while neighbour aggregation is learning the potential representation of nodes h A i ∈ R d 4 through the adjacency matrix. Then the potential representation h i is obtained by combining the two representations. The following section will detail feature aggregation, neighbour aggregation, and how to learn the representation of nodes from the adjacency matrix and feature matrix.

A. Feature aggregation.
We provide a principled approach for learning latent user factors h F i in the node-feature-space by capturing the interaction of nodes and features in the feature matrix. The purpose of feature aggregation is to learn the latent user factors h F i in the node-feature-space by the features of the user u i . The following function represents this aggregation: where C(i) indicates the feature set of node u i , x ia denotes the representation vector of the association relationship between node u i and feature v a , and AGG feature is the aggregation operation. σ (·) represents the nonlinear activation function. W denotes the trainable weight matrix, and b indicates the trainable bias matrix. In the following, we will describe how to define the association relation x ia and the aggregation function AGG feature . The association relationship between nodes and features is indicated as r. The preference of nodes for features can be captured through the association relationship, which helps model the potential factors in the node-feature-space. To model the association relationship, we introduce an embedding vector e r ∈ R d 0 for the two association relationships (associated and unassociated). We first combine the feature representation q a and association relation representation e r for the interactions between nodes u i , feature v a , and association relation r, followed by a multilayer perceptron (MLP) to obtain the feature interaction representation x ia . The mathematical representation can be described as follows: where ⊕ represents the union of two vectors and g v indicates the fusion function. AGG feature was initially considered with mean aggregation, i.e. taking the mean of all {x ia , ∀ a ∈ C(i)} vector elements. This aggregator is similar to the first-order linear approximation of a local convolution (Kipf & Welling, 2016) and can be expressed as: where the size of α i is equal to 1 |C(i)| . In this aggregator, all feature interactions contribute equally to the node u i . However, this approach is not optimal for node understanding, so we need to assign different weights to each interaction to represent the different contributions to the potential factors of the node.
Different weights need to be assigned for different interactions. We generate the feature attention α ia with a multilayered neural network, called attention network, i.e. assigning a personalised weight to each (v a , u i ). Equation (5) can be rewritten as follows: For attention networks, the inputs are the interacting association relations x ia and the embedding representations of the nodes p i . Attention network can be defined formally as: Finally, the attention weights are normalised by the softmax function and the potential factor contribution of the association relationship to node u i in the node-feature-space is obtained as:

B. Neighbour aggregation.
The feature preferences are similar to that of its neighbouring nodes, so we will combine the information of neighbouring nodes to obtain rich potential factors for the user. Therefore, we introduce an attention mechanism to aggregate the information of neighbouring nodes. To understand the nodes from their interactions, we will employ the adjacency matrix to aggregate the potential factors of neighbouring nodes. In the node-adjacency-space, the potential factor h A i of node u i is formed by aggregating the neighbouring nodes N(i). The specific function is described as follows: where AGG neighbours represents the aggregation operation of neighbouring nodes.
The AGG neighbours function is the first to adopt the mean aggregation, which performs the mean operation on the elements of the vector in {h F o , ∀o ∈ N(i)}. The function can be expressed as: where β i is fixed as 1 |N(i)| . The mean aggregator assumes all neighbouring nodes contribute to the target node. However, this method is likewise not optimal, so we also apply the exact attention mechanism to generate personalised weights indicating the importance of different neighbouring nodes to u i . The related function can be expressed as follows: where β * io represents the personalisation weight of neighbouring nodes.

C. Learning node latent factor.
To learn the potential representations of nodes more effectively, we need to consider both the potential representations of users in the node-feature-space and the potential representations of users in the node-neighbourspace. We utilise MLP to combine the two potential representations as to the potential representation of the final node, where the potential representation of the node feature space is h F i and the potential representation of the node neighbour space is h A i . Therefore, the potential representation h i of the node can be defined as: where l indicates the serial number of the hidden layer.

Feature modelling
As illustrated on the right side of Figure 4, feature modelling is utilised to achieve a potential representation Z j of feature v j in the feature-node-space by aggregating the nodes. We learn the potential representations of features by capturing the interactions of nodes with features in the feature matrix. For each feature v j , we need to capture information from the set of nodes interacting with v j , denoted as B j . There are different association relations with different nodes, even for the same feature. We use MLP to fuse the combination of the node representation p t and the association relation representation e r . The fusion function is denoted as g u . The function can be expressed as: where f jt denotes the interaction relationship. Then, we attempt to aggregate the interaction information of nodes in B j for the feature v j . The aggregation function of nodes is denoted as AGG nodes for aggregating the interactions of nodes {f jt , ∀ t ∈ B(j)}, and finally, the potential representation of features z j can be defined as: Similarly, we apply the attention method and utilise a multilayer neural network to acquire the personalised weights of each node for the features. With f jt and q j as inputs, the process can be expressed as: where μ jt denotes the personalised impact of different nodes on the potential representation of the learned features.

Feature recommendation module training
After acquiring potential representations of nodes and features, we will use feature prediction to learn the model's parameters. We first connect the two potential representations ([h i ⊕ z j ]) and then input them to the MLP for the feature prediction task. The specific process can be expressed as follows: where l denotes the serial number of the hidden layer, r ij represents the association between node u i and feature v j . Finally, the new association relations form a new feature matrix, each row representing a new feature of the node.

Feature prediction
To determine the parameters of the feature recommendation module, we use Euclidean distance as the loss function for training. The loss function can be described as follows: where |O| represents the number of association relations between nodes and features, and r ij is the true association relation between node u i and feature v j .
To optimise the objective function of the feature recommendation module, we employ Adam (Kingma & Ba, 2014) as an optimiser in practice. In the model training phase, we randomly initialise three embedding representations, including the embedding representation q j of nodes, the embedding representation p i of features, and the embedding representation e r of association relations. It is worth noting that the size of e r depends on the complexity of the relationship between features and nodes. For example, whether nodes and features are associated with each other, e r uses two different embedding vectors to represent {0, 1}. In addition, to prevent the problem of overfitting, we adopt the dropout strategy (Srivastava et al., 2014).

Low-weight feature replacement module
To obtain a richer feature representation, we model the nodes and features separately with two aggregation operations in the feature recommendation module and finally achieve the maximum possible recommended features that exceed the threshold value by relationship prediction. Meanwhile, personalised weights for the influence of each node on the features can also be available during the training process.
With these considerations, we try to use the maximum possible recommended features to replace the low-weight features of the nodes. We divide the process into two steps: the first is to add the maximum possible recommendation features for each node; the second is to remove the same number of low-weight features.

GCN module and classification learning tasks
We adopt the model proposed by Kipf and Welling (2016) in the GCN module, assuming that A ∈ R n×n denotes the adjacency matrix and X ∈ R n×d 0 represents the feature matrix, then its propagation rule can be described as: where σ (·) denotes a nonlinear activation function. H (l) ∈ R n×d l represents the (l − 1)th hidden layer's output and the lth hidden layer's input. W (l) ∈ R d l ×d l+1 denotes the trainable parameter matrix. The matrix D − 1 2 A D − 1 2 denotes the normalisation of the convolution matrix I n + D − 1 2 AD − 1 2 , where I n denotes the unit matrix, and D denotes the degree matrix. According to Kipf and Welling (2016) the best implementation is obtained by stacking two layers of GCN. As shown in Equation (19).
where W (0) ∈ R d 0 ×d 1 and W (1) ∈ R d 1 ×d 2 are trainable weight matrices of the corresponding hidden layers. ReLU(·) and softmax(·) denote the two activation functions. Furthermore, let the actual label set be Y ∈ R n×f , defined as Y ij = 1 if the label of node i is j, otherwise Y ij = 0. The cross-entropy loss defines the classification error of the training:

Experiments
In this section, we first evaluate the effectiveness of the FRS on three publicly available datasets, then analyse the model's performance from different perspectives, and finally demonstrate the portability and generalisation ability of the strategy.

Datasets
Citation networks are documents and the relationships between them, nodes represent documents, tags indicate the topics of documents, features are the bags of words contained in the content of documents, and edges denote the cross-references between documents. We tested the proposed strategy on three citation network datasets, including Cora (McCallum et al., 2000), Citeseer (Giles et al., 1998) and Pubmed (Sen et al., 2008). The details of the three citation network datasets are as follows: • Cora contains 2708 nodes and 5249 edges. All nodes are divided into seven classes, and each node has a 1433-dimensional feature vector. • Citeseer contains 3327 nodes and 4732 edges. All nodes are divided into six classes, and each node has 3707-dimensional feature vectors. • Pubmed is a relatively large citation network, containing 19,717 nodes and 44,338 edges.
All nodes are divided into three classes, and each node has a 500-dimensional feature vector.

Baseline
Since we focus only on the feature aspect of modelling, we will ignore baselines based on node aggregation modelling, such as GAT (Veličković et al., 2017), DCNN (Atwood & Towsley, 2016) and DAGNN (Liu et al., 2020). The more advanced baselines are listed below: • SemiEmb (Weston et al., 2012) utilises a graph learning method based on Laplace regularisation. • DeepWalk (Perozzi et al., 2014) is a method for learning graph representations with skipgram techniques, where node representations are learned by performing random walks through the generated node contexts. • GCN (Kipf & Welling, 2016) performs feature transformation through matrix mapping and aggregation of nodes through the pooling function.
• GIN (Xu et al., 2018) is a generalisation of vanilla GCN with feature transformation with MLP in each convolutional layer. • Cross-GCN (Feng et al., 2021) learns the hidden representation of nodes by modelling the intersection features. • GCN+RCF (Zhu et al., 2020) improves the feature learning capability of GCN by adopting the strategy of randomly covering features. • SelfSAGCN (X. Yang et al., 2021) learns nodes with the same label from both semantic and graph structure perspectives, respectively, and aligns node features with a class-centred acquaintance.

Setup
Our implementation of GCN+FRS and SelfSAGNC+FRS uses Pytorch 2 and adopts the public code of GCN 3 and SelfSAGCN. 4 To illustrate the effectiveness of both MPFR and LWFR methods in the recommended strategy, we employed the best parameters of GCN (Kipf & Welling, 2016) from the original paper.
In the GCN module, We set the value of the deactivation rate to 0.5 and weights decay to 5e−4. In the feature recommendation module, we dedicate 90% of the node features to training, and then 10% of the features are applied to test the effect of the model. In addition, we set the batch size to 128, and the relationship between nodes and features is represented with 16 bits. Notably, the threshold value is 0.9. i.e. a predicted relationship value between nodes and features above 0.9 indicates that the node has this feature. Due to the sparsity of the features, we group positive samples with an equal number of randomly sampled negative samples into a sample_set, which is employed to keep the balance of negative and positive samples during the training procedure by sample set. The model works best when the sample_set value is 12. Adam (Kingma & Ba, 2014) optimises both the feature recommendation and GCN modules, and the initial learning rates are set to 0.01.

Experimental analysis
We first test the effect of both MPFR and LWFR methods on the model, then compare the model performance with the SOTA method, and finally analyse the effect of different parameters on the model performance. In addition, to illustrate the model's effectiveness, we run our method through 20 random trials and report the average performance and margin of error.

Impact of the two methods
To test the model performance improvement by the MPFR method and the LWFR method, we evaluated different scenarios on the Cora data: using MPFR alone and a mixture of the two methods. Table 2 summarises the results of GCN performing the node classification task after using different recommendation strategies on the Cora dataset. We can observe from the results that the use of MPFR alone and a mixture of both methods (MPFR & LWFR) can improve the performance of GCN. After comparing the maximum and average values of the classification results, it can be concluded that the recommendation strategy with a mixture of the two methods allows the model to obtain better performance compared to using MPFR alone.

Performance comparison
After applying FRS from the recommendation strategy to the GCN and SelfSAGCN, Table 3 reports the model's performance on the three citation network datasets. It is clear from Table 3 that our proposed recommended strategy improves the performance of GCN by about 1.6%, 0.4%, and 0.4%, respectively. In addition, we use FRS for the latest GCN method, SelfSAGCN, and SelfSAGCN+FRS achieves the most advanced performance compared to other improved feature methods.

Effect of embedding size
We test the impact of different model complexity on the recommendation effect by representing embeddings of different sizes, and Figure 5 shows the results of the performance tests on the Cora dataset. The embedding representation size of the recommended module in the test is set to E = [8,16,32,64,128] and keeps the maximum possible features to be updated at around 500. According to the experimental results, we can get the below conclusions:  (Perozzi et al., 2014) 67.2 65.3 43.2 GCN+RCF (Zhu et al., 2020) 82.7 --GIN (Xu et al., 2018) 78.5 ± 1.9 78.7 ± 1.6 68.9 ± 2.0 GCN (Kipf & Welling, 2016) 79.1 ± 1.8 77.6 ± 2.0 69.7 ± 2.0 Cross-GCN (Feng et al., 2021) 78.9 ± 1.6 79.3 ± 1.8 71.3 ± 1.7 SelfSAGCN (X. Yang et al., 2021) 83.8 ± 0.5 80.7 ± 1.5 73. • In all cases, the model's performance with the recommended strategy is significantly better than that of the GCN alone. It further illustrates the effectiveness of our proposed recommendation strategy. • In addition, GCN+FRS (GCN+FRS2) with E = 16 exhibits exciting performance, consistently outperforming embedded representations with other sizes.

Different number of sample_set
This subsection tests the effect of different numbers of sample sets on model performance. Figure 6 shows the classification task on the Cora dataset with a different number of sample sets. From the results observed, we obtain the following conclusions: • Overall, better performance is obtained with the recommended policy than without it. The effectiveness of the recommended policy is illustrated from this perspective. • The model performed best on the Cora dataset when the value of the simple_set equals 12. • From the results of the classification task, the performance generally shows an increasing trend as the simple_set increases from 1 to 12, but after exceeding 12 (Less than 19), the performance starts to decrease.
We tested the prediction of the feature matrix using the recommendation module alone with a correct rate of 74.1%. From the results, we can see noise in the predicted features.
When the correct number of recommended features is obtained, the positive impact of the recommendation strategy on the classification results outweighs the negative impact, and conversely, when the number of recommended features exceeds a specific value, the result is the opposite.

Hyperparametric analysis
We also investigated the effect of hyperparameters on the recommendation effectiveness of the recommendation module. We have chosen the dropout parameter as an example to illustrate that this parameter affects the results of the classification task by influencing the recommended features. Figure 7 shows the effect of the parameters on the model's performance when increasing the dropout from 0.1 to 0.7 on the Cora dataset. From the results, we have the following analysis: Figure 7. Hyperparametric analysis. GCN+FRS1 denotes a single MPFR method operating on GCN, whereas GCN+FRS2 represents a combination of MPFR and LWFR methods working on GCN. GAT (Veličković et al., 2017) 83.0 ± 0.7 72.5 ± 0.7 79.0 ± 0.3 GAT+FRS1 83.5 ± 0.6 72.7 ± 0.9 79.3 ± 0.6 GAT+FRS2 83.6 ± 0.7 72.9 ± 1.1 79.2 ± 0.9 • After the dropout value is more significant than 0.5, the recommendation module does not obtain the maximum possible characteristics of the node. • The best performance of the GCN+FRS model is obtained when dropout value is equal to 0.5 (GCN+FRS2). This result is reasonable since the most randomly generated network structures are available in this condition. • GCN+FRS1 performs more consistently than GCN+FRS2. This is that GCN+FRS1 solely employs the MPFR method, which keeps each node's original features, but GCN+FRS2 combines MPRF and LWFR methods to replace original features with predicted features. The noise affects the classification task results when too many features are replaced.

Portability and generalisability
We experiment on another representative model, GAT, to test our proposed recommendation strategy's portability and generalisation ability. The experiment results are shown in Table 4. As we can see in Table 4, the feature recommendation strategy also improves the performance of the GAT model, demonstrating the portability and generalisation ability of the recommendation strategy.

Limitations
In this section, we describe several limitations of the proposed model.
Rich information on node and feature interactions The current model understands the interaction between a node and a feature as {0, 1}, i.e. whether the node possesses the feature or not. However, the interaction relationships in real datasets are rich in correlations. For example, in the citation network dataset, each paper is represented by the bag of words it contains while ignoring other information such as the frequency of the bag of words.
Feature-to-feature interaction The current model only focuses on three interactions: node-to-node, node-to-feature, and feature-to-node, but feature-to-feature should also have similar interactions to those between nodes. Feature-to-feature interaction can achieve feature recommendation with better performance.

Conclusion
In this paper, we proposed a feature recommendation strategy (FRS) for graph convolutional networks. It provides two feature recommendation strategies, maximum possible feature recommendation and low-weight feature replacement, for updating node features and improving the performance of the GCN model. Experiments are conducted on three datasets, and the results demonstrate the effectiveness, portability, and generalisability of the recommendation strategy. In the future, we will discuss how to integrate feature recommendation methods into different heterogeneous graph neural frameworks to achieve better messaging and generate more effective patient representations.