Multi-Semantic Alignment Graph Convolutional Network

ABSTRACT Graph Convolutional Network (GCN) is a powerful emerging deep learning technique for learning graph data. However, there are still some challenges for GCN. For example, the model is shallow; the performance is poor when labelled nodes are severely scarce. In this paper, we propose a Multi-Semantic Aligned Graph Convolutional Network (MSAGCN), which contains two fundamental operations: multi-angle aggregation and semantic alignment, to resolve two challenges simultaneously. The core of MSAGCN is the aggregation of nodes that belong to the same class from three perspectives: nodes, features, and graph structure, and expects the obtained node features to be mapped nearby. Specifically, multi-angle aggregation is applied to extract features from three angles of the labelled nodes, and semantic alignment is utilised to align the semantics in the extracted features to enhance the similar content from different angles. In this way, the problem of over-smoothing and over-fitting for GCN can be alleviated. We perform the node clustering task on three citation datasets, and the experimental results demonstrate that our method outperforms the state-of-the-art (SOTA) baselines.

The basic process of GCN is to aggregate node information iteratively from local graph neighbourhoods and propagate it through the graph after feature transformation. It is usually grouped into two methods: spatial convolution and spectral convolution. The spatial convolution method defines the convolution operation on the spatial relations of the nodes, learning and updating the representation from their neighbourhoods. For example, Hamilton et al. (2017) proposed a generic induction framework, GraphSAGE, which exploits node feature information to generate node embeddings for unknown data. Atwood and Towsley (2016) designed diffusion convolutional neural networks (DCNN) that learn diffusion-based representations from graph-structured data and utilise them for node classification. Monti et al. (2017) suggested a unified framework called MoNet that generalises convolutional neural network architectures to non-Euclidean domains. In addition, Veličković et al. (2017) included a graph attention network (GAT) that assigns personalised weights to neighbouring nodes by stacking layers in which nodes can participate in their neighbourhood features.
On the other hand, the spectral convolution method implements the convolution operation on the topological map using spectral map theory. For example, Bruna et al. (2013) first proposed a CNN network that extended to graphs based on spectral graph theory. Then, Defferrard et al. (2016) suggested the ChebNet model, which reduces the computational complexity by defining the filter as a Chebyshev polynomial of the diagonal matrix of the eigenvectors. Based on the previous work, Kipf and Welling (2016) introduced a simple and effective layer propagation method with a first-order approximation to simplify the calculation method. In addition, Wu et al. (2019) designed a simple graph convolution (SGC) that captures higher-order information in a graph by a single linear function.
Despite its remarkable success, GCN has two significant limitations. The first limitation is that GCN performance weakens when there are few labelled nodes in each class, and it can easily lead to an over-fitting phenomenon. GCN propagates features through the graph structure, and labels cannot propagate the whole graph when the nodes are sparse. The second limitation is that most of the existing GCN is usually shallow, and Kipf and Welling (2016) showed that the most optimal performance was obtained with a 2-layer model. Q. Li et al. (2018) explained that the essence of GCN is to make a linear combination of each node's neighbouring features and its features. When there are too many stacking layers, the features of nodes aggregate too many neighbouring features, which can lead to the over-smoothing problem, i.e. the nodes become similar to each other and similar between classes. M. Liu et al. (2020) demonstrated that GCN could lead to the over-fitting phenomenon for indistinguishable features of nodes while expanding the perceptual field by stacking multiple hidden layers.
Graph structure-based propagation is to deliver the labelled node's features to the unlabelled nodes, making the features for nodes in the same class similar. Unfortunately, GCN's graph propagation is insufficient  and often leads to over-fitting and over-smoothing. In recent research, semantic information of nodes has been of particular interest. For example, M. Liu et al. (2020) proposed Deep Adaptive Graph Neural Network (DAGNN) by decoupling representation transformation and propagation entanglement to self-adaptively merge semantic information from large receptive fields. Chen et al. (2020) suggested an extension of the vanilla GCN model called GCNII, which employs constant mapping to store the input information directly. Pei et al. (2020) designed a geometric aggregation scheme (Geom-GCN), which maps nodes as a vector in continuous space and finds neighbours and aggregates them, capturing long-range dependencies in the graph. M.  introduced a non-local aggregation method for capturing remote dependencies from node features. In addition, X.  proposed a self-supervised semantic alignment graph convolutional network (SelfSAGCN) that extracts semantic information from labelled nodes and overcomes the over-smoothing problem by aligning the node features obtained from various angles. Inspired by this work, we attempt to explore feature semantics from three perspectives: node, feature, and graph structure, alleviating the over-fitting and over-smoothing problems.
In this paper, we propose a multi-angle semantic aligned graph convolutional network (MSAGCN) that contains two fundamental operations: multi-angle aggregation and semantic alignment. Specifically, we first aggregate and capture semantic information layer by layer for labelled nodes from three perspectives: node, feature, and graph structure, respectively, and then semantic alignment operation similarises the learned semantic mappings. Multi-angle aggregation can obtain different semantics from multiple contextual settings, and semantic alignment can help us reinforce the similarity in the different semantics. In this way, the problem of over-smoothing can be effectively mitigated. Notably, when labelled nodes are severely scarce, semantic alignment can transfer the semantics learned from labelled nodes to unlabelled nodes, further improving the model's performance.
In summary, the main contributions of our work are two-fold: • We propose a Multi-Semantic Alignment Graph Convolutional Network model that uses multi-angle aggregation and semantic alignment techniques to mitigate the over-fitting and over-smoothing problems. • We evaluate MSAGCN on three citation network datasets, and the experimental results demonstrate that it outperforms SOTA methods on the classification task.

Related work
Usually, GCN maps the features of nodes into the low-latitude space by multiple graph convolution layers. The mapping consists of two primary operators: node aggregation operator and feature transformation operator. Node aggregation operator enhances the representation for the target node by fusing the features of neighbouring nodes. For example, different aggregation methods are proposed for different connection behaviours, such as local node similarity (Kipf & Welling, 2016) and structural similarity (Donnat et al., 2018). In addition, the sampling content can be of all neighbouring nodes (Kipf & Welling, 2016) or a fixed number of neighbouring nodes (Hamilton et al., 2017). It is worth noting that different aggregation methods obtain different information. For example, the average pool operation obtains common attributes (Kipf & Welling, 2016), whereas the maximum pool operation obtains the most salient features (Hamilton et al., 2017). Feature transformation operator is the mapping of the input features into the low latitude space, and its transformation method is one of the topics of current research, such as feature cross-fusion (Feng et al., 2021) and random overlay features (Zhu et al., 2020). GCN's current primary research approach is to take full advantage of the graph's topology, node features, and labelling information. For example, Qin et al. (2021) jointly enhanced model performance with given and estimated labels. Feng et al. (2021) designed a cross-feature graph convolution operator for arbitrary-order feature cross-modelling. L. Yang et al. (2019) exploited the relationship between topology and features to leverage latent information by jointly optimising the network topology and learning the parameters of a fully connected network. Qin et al. (2022) utilises a feature recommendation strategy to optimise node features and improve model performance. In addition, research on the semantic information of nodes has received increasing attention. For example, Pei et al. (2020) and M.  proposed non-local aggregators to capture the remote dependencies of node features. Lin et al. (2020) incorporated metric learning into a graph-based semi-supervised learning paradigm. X.  extracted semantic information from labelled nodes and then assisted label propagation through alignment operations. In addition, many works in graph representation learning methods utilise node attributes to enhance node feature representations to address the problem of sparsely labelled data (J. H. Li et al., 2021;Pan et al., 2021).
Recently, some work has attempted to mitigate or solve the over-smoothing problem allowing the model depth not to be limited to shallow. Q. Li et al. (2018) proved that GCN processing is a particular symmetric form of Laplace smoothing, and the essence is to make a linear combination of each node's neighbouring features and its features. When layers are over-stacked, node features aggregate too many neighbouring features, resulting in the over-smoothing problem, in which nodes become similar. Xu et al. (2018) utilised layer aggregation to enable the final representation of nodes to fuse information from different layers adaptively. Rong et al. (2019) removed edges from the graph at random to alleviate the effects of over-smoothing. Klicpera et al. (2018) relieved the over-smoothing phenomenon by employing the Personalised PageRank matrix. Klicpera et al. (2019) generalised Personalised PageRank to any intent diffusion process. In addition, X.  designed an identity aggregation method to capture semantic information from nodes with truth labels and employed a semantic alignment operation to guide graph propagation. Based on the work of X. , we extract semantic information from three perspectives of nodes, features, and graph structure to assist the propagation of labels in the graph, further alleviating the problems of over-smoothing and over-fitting. It is worth noting that aggregating features from different angles and aligning the semantics can strengthen the similar parts and obtain a better representation of node embeddings.

Preliminary knowledge
Before describing the details of the model, we initially present some of the notations used in each section. Bold uppercase X and lowercase x denote matrices and vectors, respectively, and all vectors are in column form. We provide a list of commonly used terms and symbols in Table 1. Table 1. Symbols commonly used in our paper.

Symbols
Descriptions The representation dimension of the input node at the mth level.
The features of the graph.

A ∈ R n×n
The adjacency matrix of the graph.
The trainable weight matrix.
The identity matrix.

D ∈ R n×n
The degree matrix.
The label matrix.
The excess matrix.
The class-centred of class j. sed (·) The function calculates the squared Euclidean distance between matrices.
Then, we briefly review the basics of GCN (Kipf & Welling, 2016). Graph G can usually be expressed as G = (X, A), where X = [x 1 , x 2 , . . . , x n ] T ∈ R n×d 0 denotes the feature matrix and A ∈ R n×n indicates the adjacency matrix. A is applied to encode whether there is a connection between node i and node j. If there is a connection, then A ij = 1. If not, A ij = 0. The basic process of GCN is to map the input into the low-latitude space by stacking several hidden layers and then connecting an output layer that performs the learning task. Given X and A, the propagation process of GCN can be described as: where H (m+1) ∈ R n×d m+1 represents the input of the (m + 1)th hidden layer, which is the output of the mth hidden layer transformed by the nonlinear activation function σ (·). W (m) ∈ R d m ×d m+1 indicates the trainable weight matrix of the mth layer. In addition, where I n and D indicate the identity matrix and degree matrix, respectively. For ease of description, let propagation rule of GCN can be described as: The feature representation of the node H (M) is obtained after stacking multiple hidden layers, and then the final output for the node classification task is achieved by the softmax activation function. Let H out = softmax(H (M) ) ∈ R n×c , and c is the number of classes in a specific task. In addition, let the label matrix be Y ∈ R n×c , if Y ik = 1 means the label of node i is k. Otherwise, Y ik = 0. In order to train the weight matrix W (m) , the loss function can be defined as the cross-entropy of the classification task, which can be described as follows: where R denotes the set of nodes with actual labels.

Multi-semantic aligned graph convolutional network
In this section, we propose a Multi-Semantic Aligned Graph Convolutional Network (MSAGCN). The single-layer processing flow of the model is illustrated in Figure 1. The semantic information is first obtained by aggregation from three perspectives of nodes, features, and graph structure, and then aligned by class-centred similarity, which is jointly utilised to solve the problem of over-smoothing and over-fitting of GCN.

Multi-angle aggregation
MSAGCN has three different inputs to implement multi-angle aggregation operations. The first perspective is graph structure aggregation, i.e. aggregation of features from the graph structure perspective. Given the adjacency matrix A and the identity matrix X, the propagation rule can be described as: The second perspective is feature aggregation, which is only an aggregation of features from the feature perspective. Given the feature matrix X, its propagation rule can be described as: The third perspective is the node aggregation, i.e. the nodes are aggregated from the feature perspective, and the node features for the current layer are obtained after processing by the CM module. Given a feature matrix X, its propagation rule can be described as: where I g ∈ R d 0 ×d 0 denotes the identity matrix (d 0 indicates the dimensionality of the node features.) and Q (m) ∈ R l 0 ×l m+1 represents the trainable parameter matrix. X T means transposing the feature matrix, and G (m) cm is the conversion of the feature representation for the corresponding layer to the node representation, where U ∈ R d 0 ×d 0 indicates the excess matrix. In addition, when m = 0, d m = l m . When m = 0, d 0 and l 0 represent the feature dimension and the number of nodes, respectively.
In summary, the outputs of MSAGCN are the features H (M) , F (M) , and G (M) , respectively, and the final outputs obtained using softmax activation are H out = softmax(H (M) ) ∈ R n×c , F out = softmax(F (M) ) ∈ R n×c , and G out = softmax(G (M) ) ∈ R n×c . Finally, we train the parameters in the model by minimising the cross-entropy loss, and the loss function can be described as: where R represents the set of nodes with actual labels. Y ik indicates the relationship between node i and label k. If the label of node i is k, then Y ik = 1, otherwise Y ik = 0. With the increase of stacking layers, the features of nodes become similar or identical after transformation. The second and third terms in L Semi can effectively alleviate this situation by aggregating from three perspectives.

Semantic alignment
Aggregation is performed from three perspectives: graph structure, features, and nodes, with H (m) , F (m) , and G (m) cm denoting node features after convolution, respectively. As the convolution of three angles operates on the same feature matrix, we assume that the features of nodes in the same class should be similar after aggregation in the ideal case. In addition, we utilise the same network parameters in the last layer of the model to enable G (M) to to obtain guidance from H (M) and F (M) . In conclusion, since neither F (m) nor G (m) produces over-smoothing when stacking more layers, we adopte F (m) and G (m) cm as the guiding semantics for H (m) . Meanwhile, let F (m) R and G (m) cm,R denote the semantic information extracted from R and utilise the semantic alignment operation to make the three perspectives semantics of nodes for the same class mapped nearby.
In semi-supervised learning, since the labels of most nodes are unknown, we utilise pseudo-labels to achieve semantic alignment during model training. We adopt the corresponding actual labels for F (m) R , G (m) cm,R and the nodes with actual labels in H (m) as the class information. In contrast, for unlabelled nodes in H (m) , the MSAGCN model assigns pseudolabels to the nodes and employs class-centred similarity to alleviate the negative influence of pseudo-labels. The class-centred similarity can be described as: cm,R ) denote the class-centred of class j belonging to F (m) R , H (m) , and G (m) cm,R , respectively. The sed(·) function calculates the squared Euclidean distance between matrices. As with X. , we adopt the class-centred alignment to mitigate the noise of H (m) .
The overall framework of our proposed MSAGCN is illustrated in Figure 2, which utilises three semantic alignments per layer to mitigate the noise during graph propagation to alleviate the over-smoothing problem. In addition, the class-centred similarity for labelled and unlabelled nodes can provide supervised information for the classification task and improve the model's classification performance. The loss function for semantic alignment can be described as: Combining the losses of both classification and semantic alignment, we can describe the total objective function as: In order to improve the stability of the pseudo-label construction class-centred in the classification task, we first compute the respective class-centred C j (F (m) R ), C j (G (m) cm,R ), and C j (H (m) ) with the current layer features F (m) R , G (m) cm,R , and H (m) during each iteration. Then, the classcentred of the last iteration is added to the class-centred of the current layer to ensure stability. This can be described as follows: where α ∈ [0, 1) denotes the weighting factor. Finally, MSAGCN algorithm is shown in Algorithm 1.

Further analysis
The features of a node in the ideal environment can be utilised to determine its class. For the semi-supervised classification task, graph structure-based propagation can pass semantic information from labelled to unlabelled nodes because we assume that neighbouring nodes usually belong to the same class. However, M. Liu et al. (2020) proved that the class Figure 2. The overall framework of MSAGCN. We obtain node features by aggregation from three perspectives: graph structure, features, and nodes, and then achieve similar features from the above perspectives layer by layer with semantic alignment.

Algorithm 1 Multi-Semantic Alignment for Graph Convolutional Network.
Input: the adjacency matrix, A; the identity matrix, X; the labelled nodes, Y R ; the number of layers and classes, M and c; the hyperparameters, λ and α; while not converge do Calculate H (m) by Equation (4); Calculate F (m) by Equation (5); Calculate G (m) by Equation (6); Calculate the class-centred of each class by Equation (11); update all the parameters by minimising Equation (10); end while Output: the final output for node classification task H out ; of nodes is determined by their features rather than their topology. Therefore, based on X. , we add the filtering of nodes from the perspective of features and then obtain the node features from another perspective by transformation. Finally, we adopt the semantic alignment operation for feature learning during constraint graph propagation.
Moreover, we utilise semantic alignment operations to ensure that all obtained features are similar. In summary, we extract node features from different perspectives and drive all feature distributions to converge, which provides more constraints on node feature learning and improves performance.

Experiments
In this section, we will evaluate the performance of the proposed model by conducting extensive experiments on three benchmark citation datasets. In addition, we will provide some visualisations to help illustrate the results.

Datasets
We evaluate the proposed method on three benchmark citation datasets, including Cora (McCallum et al., 2000), Citeseer (Giles et al., 1998), and Pubmed (Sen et al., 2008). The Citation network represents the citation relationship between documents and documents. Nodes and labels represent documents and their topics. Features of nodes are the bag of words in the document content, and edges represent the mutual references between documents.
• Cora has 5429 edges and 2708 nodes. Seven class labels exist for nodes and 1433dimensional feature vectors for each node. • Citeseer has 4732 edges and 3327 nodes. Six class labels exist for nodes and 3707dimensional feature vectors for each node. • Pubmed is a more extensive citation network containing 44,338 edges and 19,717 nodes.
Three class labels exist for nodes and 400-dimensional feature vectors for each node.

Baseline
We evaluate the model's performance in two aspects, so the baselines are divided into two categories. The first category is that the number of labelled nodes is severely scarce, and the second category is that stacking multiple layers leads to an over-smoothing phenomenon. Category 1: the number of labelled nodes is severely scarce.
• DropEdge (Rong et al., 2019) randomly removes random edges from the graph to reduce the convergence speed of over-smoothing. • ResGCN (G.  borrows from CNN by introducing residual/dense connectivity and dilation convolution to build deep convolutional networks. • JKNet (Xu et al., 2018) proposes jumping knowledge networks (JK), allowing each node to exploit different neighbourhood ranges flexibly to achieve a better structure-aware representation. • IncepGCN (Rong et al., 2019) extends convolutional networks by decomposing convolution and regularisation. • GCNII (Chen et al., 2020) is an extension of vanilla GCN that utilises identity mapping and initial residuals to mitigate the over-smoothing problem. • SelfSAGCN (X.  learns the features of nodes from the same class in terms of both graph structure and semantics and maps nearby.

Setup
Our implementation of MSAGCN uses Pytorch 1 and adopts the public code of SelfSAGCN (X. . 2 Our method's parameter settings are identical to those of SelfSAGCN (X.  to ensure the comparison's fairness for the experimental results. We employ the Adam optimiser, with the learning rate set to 0.01, the dropout set to 0.5, the weight decay set to 5e−4, and the class-centred weighting factor set to 0.7. The λ (λ ← λ( 2 1+e −10×p − 1)) is taken to mitigate the noise of pseudo-labels in the early stage of training, where p denotes the training period.

Performance comparison
We evaluated the performance of MSAGCN in the usual case (i.e. 20 labelled nodes per class). As shown in Table 2, our model performs relatively well in the usual case.

Model performance with different numbers of labelled nodes
For the completeness of the experiment, we evaluate the model's performance with various numbers of labelled nodes. Figure 4 illustrates the results in more detail.
From Figure 4, we have the following observations: • The performance increases significantly for both Cora and Pubmed datasets as the number of labelled nodes increases.  • The model performance improves dramatically as the number of labelled nodes per class increases from 1 to 2 in the Citeseer dataset. However, the performance increase is relatively small when the number of labelled nodes per class increases from 2 to 20.
From these observations, it can be concluded that MSAGCN performs well when labelled nodes are severely scarce. In addition, the model has relatively poor performance improvement on the Citeseer dataset due to its sizeable feature-to-node ratio, i.e. each feature corresponds to relatively few nodes.

Parameter analysis
In this section, we analyse the parameters in MSAGCN. Figure 5(a,b) show the influence of two hyperparameters on the model performance.  Figure 5 shows the following observations: • Our model is not sensitive to α within the range [0.3, 0.9] based on investigating different balance weights alpha in Figure 5(a). • Once the lambda exceeds 0.55 in a two-layer model, the parameter λ has little effect on model performance in Figure 5(b). • As shown in Figure 5, our model's performance is insensitive to the two hyperparameters when there are more labelled nodes per class.

model performance with multiple layers
In this section, we evaluate the performance results of the model with different hidden layers. Table 4 shows the model's performance with different numbers of hidden layers, the same as the SelfSAGCN setup, where we apply 20 labelled nodes per class for training. According to Table 4, we can observe that our algorithm performs best when we apply the 4-layer model. The results illustrate that our proposed model can strengthen the similar parts by aggregating features from different perspectives and aligning the semantics, which can partly alleviate the over-smoothing problem.

Limitations
In this section, we will discuss some of the limitations that exist in the model.
Rich information contained in the connection The current model of multi-angle extraction of features focuses only on whether the feature is included and ignores other important information, such as the frequency of feature occurrence. In addition, extracting the semantic information in a node from the feature perspective focuses only on whether the node is possessed or not, and ignores the structural information between nodes, i.e. ignores the potential nodes contained in the feature.
Feature-to-feature interaction The model considers the existence of cross-reference relationships between nodes and ignores the possibility of some relationship between features. For example, the features indicate similar or opposite meanings and the size of the range of features.

Conclusion
In this paper, we propose a Multi-Semantic Alignment Graph Convolutional Network (MSAGCN) to address the issues of over-fitting and over-smoothing. The method extracts semantic information from labelled nodes layer by layer with multi-angle aggregation operation and then processes the obtained node features by semantic alignment operation, effectively alleviating the over-fitting and over-smoothing of GCN. The proposal of MSAGCN has promoted the development and application of network topology research. In addition, we build the class-centred for unlabelled nodes with assigned pseudo-labels and gradually revise them to mitigate noise. As a result, the node features extracted from different angles are forced to converge, resulting in improved node classification. We evaluate the model on three benchmark datasets, demonstrating that our method outperforms SOTA methods on classification tasks. In the future, we will apply the method of MSAGCN to heterogeneous graphs, where the multi-angle aggregation is easier to find vital features in heterogeneous relationships. Notes 1. https://pytorch.org 2. https://github.com/xdxuyang/SelfSAGCN

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX21_0484).