A method to improve the computational performance of nonlinear all—optical diffractive deep neural network model

Abstract To further improve the computational performance of the diffractive deep neural network (D2NN) model, we use the ReLU function to limit the phase parameters, which effectively solves the problem of vanishing gradient that occurs in the mitigation model. We add various commonly used nonlinear activation functions to the hidden layer of the model and establish the ReLU phase-limit nonlinear diffractive deep neural network (ReLU phase-limit N-D2NN) model. We evaluate the model by comparing the performance of various nonlinear activation functions, where confusion matrix and accuracy are used as evaluation methods. The numerical simulation results show that the model achieves better classification performance on the MNIST and Fashion-MNIST datasets, respectively. In particular, the highest classification performance is obtained by the ReLU phase-limit N-D2NN model, in which the hidden layer uses PReLU, with 98.38% and 90.14%, respectively. This paper provides a theoretical basis for applying the nonlinear D2NN systems in natural scenes.


Introduction
Diffractive deep neural network (D 2 NN) [1] is a relatively new approach to optical neural networks (ONN), which first came from a publication in 2018 by Lin et al.The team used a terahertz light source as input, and they first built an all-optical diffractive deep neural network (D 2 NN) model using Rayleigh-Sommerfeld diffraction theory.Next, they optimized the model parameters using a stochastic gradient descent algorithm, and then the team fabricated diffraction elements using 3D printing technology and constructed the D 2 NN system, which can classify images of handwritten figures and fashion products.D 2 NN has the following advantages, (1) The whole system adopts passive devices, so the power consumption of the system is low.(2) The system adopts the principle of optical diffraction.D2 NN connects hundreds of millions of neurons hundreds of billions of times in a fully connected neural network, so the system runs at the speed of light.Since the original D 2 NN was proposed, many scholars have carried out a lot of research on improving computing performance and application scenarios of D 2 NN.For example, D 2 NN can be applied in scenarios such as beam conversion, [2] frequency control, [3] and mode differentiation. [4]In particular, we describe in detail a series of related works on improving the performance of D 2 NN computation in Section 2, these include adding activation functions to the hidden layer of the model, and using nonlinear optical materials to make diffraction gratings for the D 2 NN framework.With the help of the D 2 NN framework, researchers can apply it to tasks such as image analysis, feature detection and object classification.
In previous studies, the original D 2 NN model used the Sigmoid function to constrain the phase, which has the vanishing gradient problem during error back propagation.To solve this problem, we propose a ReLU phase-limit N-D 2 NN model in this paper.We use the ReLU function to limit the phase factor in the model and verify the effectiveness of the ReLU phase-limited N-D 2 NN model by numerical simulation, in which the hidden layer incorporates various common activation functions.Because ReLU has a constant gradient of 1 in the positive interval compared to Sigmoid, it can better update the parameters of the D 2 NN model.The ReLU function is not an exponential function, so the derivative computation will be much saved when back propagation error transfer is performed.We detail the advantages of choosing the ReLU function to limit the phase factor in Section 3.1.Interestingly, the proposed method further improves the computational performance of D 2 NN.The ReLU phase-limit N-D 2 NN achieves 98.38% accuracy on the MNIST dataset, and 90.14% accuracy on the Fashion-MNIST dataset, which is the highest accuracy known for the D 2 NN framework.In summary, this study proposes an approach to alleviate vanishing gradient the N-D 2 NN model, and evaluates the inference performance of ReLU phase limited N-D 2 NN model, where various nonlinear activation functions are added to the hidden layer of the model.
The rest of this paper is organized as follows.In Section 2, we report on related work to D 2 NN.In Section 3, we describe the principle of the method of alleviating vanishing gradient, the ReLU phase-limit N-D 2 NN model, and the nonlinear activation function of the model.In Section 4, we present the numerical experiment results of the ReLU phase-limit N-D 2 NN model with a nonlinear activation function, including the introduction of the dataset, evaluation method, and selection of hyperparameters, and compare the computational performance with other nonlinear D 2 NN models.In Section 5, we state the conclusion.

Nonlinear study of D 2 NN
The study of optical nonlinearities has been an important part of D 2 NN.A core unit of neural network design is a nonlinear activation function, which endows neurons with the ability to self-

D2NN
Diffractive deep neural network ReLU phase-limit N-D2NN  NN in a series of studies. [5,6]Later, many researchers continued to explore and apply the nonlinearity of D 2 NN.In July 2019, Yan et al. introduced the Fourier spatial diffraction deep neural network (F-D 2 NN) that uses diffraction layer modulation through the Fourier transform in the optical system.The nonlinear optics use a ferroelectric film, and the method is effective for high-precision visual saliency detection and target classification. [7]In June 2020, Zhou et al. proposed an optical error backpropagation framework for field training of linear and nonlinear D 2 NN, which could accelerate the training speed and improve energy efficiency. [8]n February 2021, our research team proposed a miniaturized nonlinear all-optical diffractive deep neural network (N-D 2 NN) model using ReLU family functions as nonlinear activation functions. [9]In January 2021, Kulce et al. investigated the dimensions of the all-optical solution space covered by the D 2 NN design, and increasing the number of trainable neurons could enhance the reasoning ability of D 2 NN. [10]In July 2021, Luo et al. proposed a metasurface-based diffractive neural network (MDNN) framework based on polarization multiplexing, which can extend the channel of the neural network. [11]In January 2022, our research team proposed an all-optical diffractive neural network based on the nonlinear optical materials (DNN-NOM) model, the optical limiting effect function is taken as the nonlinear activation function of the neural network. [12]At the same time, Ighodalo et al. proposed an algorithm for the reverse design of the D 2 NN using the (DNA) 2 optimization program.These studies show the potential of D 2 NN in all-optical imaging and signal processing, and provide new insights for further research. [13]

Other improvements to D 2 NN
In addition to the use of optical nonlinear properties, there are other research methods currently available to improve D 2 NN performance.In December 2019, Chen et al. proposed using multi-frequency channel D 2 NN to improve classification accuracy by combining light waves of different frequencies. [14]In January 2020, Mengu et al. improved D 2 NN by changing the loss function and using a complex-valued neural network to modulate phase and amplitude. [15]In May 2020, Dou et al. proposed Residual D 2 NN (Res-D 2 NN) to solve the problem of vanishing gradientby constructing diffraction residual learning blocks. [16]In July 2020, Mengu et al. proposed vaccinated D 2 NN (V-D 2 NN) to significantly improve the tolerance margin of D 2 NN by modeling undesirable system changes as continuous random variables. [17]In December 2020, Mengu et al. proposed a method to quantify sensitivity to overcome D 2 NN's sensitivity to spatial scaling, translation, and rotation of input objects. [18]Meanwhile, Su et al. proposed multiwave optical diffractive network (MWDN) that performed different tasks using optical diffraction and laser wavelengths.These methods significantly improved the robustness of D 2 NN and its ability to adapt to changes in undesired target fields. [19]In December 2020, Shi et al. proposed the strong robustness active neural network (SRNN) model, called D 2 NN, to design an intelligent imaging detector array.The model reduced the impact of structure error and optical wave frequency shift on D 2 NN output. [20]In the training process, they simulated the error distribution of the solid phase mask by adding Gaussian noise into weights.In January 2021, Rahman et al. trained 1252 D 2 NNs with different designs of passive input filters and proposed an iterative pruning algorithm to optimize D 2 NN, which improved the classification accuracy and generalization ability of the model. [21]In March 2021, Li et al. proposed a D 2 NN system for optical classification using a single-pixel spectral detector that used an array of diffractive modulation layers to encode spatial information into the power spectrum of the diffracted light. [22]In October 2021, Xiao et al. proposed a concept of optical random phase difference to improve the generalization ability of D 2 NN model by using complex conjugate factors and compatibility conditions. [23]In July 2021, Shi et al. proposed a multiple-view D 2 NNs array (MDA) system that combined the information of multiple views of 3D objects with a hybrid photoelectric structure to improve the classification accuracy of 3D objects. [24]In January 2022, Panda et al. proposed a fault-tolerant optical neural network with anti-noise performance using regularization in simulation. [25]In January 2023, Dong et al. proposed an OReLU function using optoelectronic devices to increase sparsity and classification accuracy.By optimizing the threshold factor, they achieved high accuracy on the MNIST and Fashion-MNIST datasets with genetic algorithms. [26]In February 2023, Zhou et al. proposed a novel framework by incorporating a highway network and a wavelet-like phase modulation pattern (WPMP) technique into D 2 NN.The researchers demonstrated that WPMP can significantly reduce the parameters of the network layer by modulating the phase of the incident light. [27] Methods

Methods to mitigate vanishing gradient
In N-D 2 NN, the error backpropagation is used to update the network parameters, which involves computing the error between the network output and the true value, and propagating it backwards through the network.Gradient is a very critical concept in the error backpropagation process, which represents the rate of change of the error with respect to N-D 2 NN parameters, and it is the basis of the optimization algorithm.However, when the number of N-D 2 NN layers is large, the error will gradually become smaller after several times of back-propagation, which leads to the gradient also becoming smaller or even disappearing.That is, the vanishing gradient problem.
Vanishing gradient can cause the network parameters to fail to update, thus affecting the training effect of N-D 2 NN.In the original N-D 2 NN model, each neuron contains a complex neuron modulation function t l i : In the previous study on N-D 2 NN, the phase u l i 1 in network training is represented by a potential variable d, namely: [15] In Equation ( 1), the Sigmoid function acts as an auxiliary variable rather than information flowing through the network, and the Sigmoid function limits u l i 1 to the interval (0, 2p).However, using the sigmoid function to limit the phase has the following shortcomings.Sigmoid function can saturate, when it takes very large positive (negative) values in phase in absolute terms.This means that the function becomes insensitive to small changes in the input.Moreover, in the original N-D 2 NN model for backpropagation, the weights are basically not updated when the gradient is close to 0, and the gradient can easily disappear, so the training of N-D 2 NN cannot be done effectively.the output of the Sigmoid function is not zero-centered.This will lead to the input of non-zero-centered signals to the neurons in the later layers, which will have an impact on the gradient.Because the Sigmoid function is in exponential form, the computational complexity is high.To solve the above problem, Equation ( 2) is used to alleviate such problems: Equation (2) shows that the phase u l i 2 of each neuron becomes unbounded.However, because the exp ðju l i Þ term is periodic and bounded to u l i 2 , the error backpropagation algorithm can solve it.Compared with the Sigmoid function, the ReLU function will make some of the neurons zero, which causes the sparsity of the network and reduces the interdependence of the parameters, alleviating the overfitting problem.Therefore, in this paper, we use the ReLU function to limit the phase, so as to alleviate the problem of vanishing gradient the ReLU phase-limit N-D 2 NN model training.

ReLU phase-limit N-D 2 NN model principle
The ReLU phase-limit N-D 2 NN model can be found in our previous work, which includes the establishment of N-D 2 NN [9] and the complex neural network. [28]Figure 1 represents the schematic diagram of the ReLU phase-limit N-D 2 NN structure.The neurons in each layer of the ReLU phase-limit N-D 2 NN model can be calculated by the secondary wave source equation, which is as follows: where l represents the l-th layer of the network, i describes the i-th neuron of the l-th layer, r represents the Euclidean distance between the i-th neuron of the l-th layer and the neuron of the ðl þ 1Þ-th layer, j ¼ ffiffiffiffiffiffi ffi À1 p , k represents the incident wavelength, and the input plane is the 0-th layer, then corresponding to the l-th layer, the output field can be expressed as: where n l i ðx, y, zÞ represents the output of ðx, y, zÞ the i-th neuron at layer l-th of jAj Á e jDh ¼ t l i ðx i , y i , is the relative amplitude of the second wave, jAj is the phase delay increased by input wave P k n lÀ1 k ðx i , y i , z i Þ and complex neuron modulation function t l i ðx i , y i , z i Þ on each neuron, and t l i is the multiple modulations, namely: To simplify the representation of Equations ( 3), (4), and ( 5), it can be rewritten as the formula of the forward propagation model: where i represents the i-th neuron in the l-th layer, p represents the p neuron in the ðl þ 1Þ-th layer, and the neurons between each layer are fully connected through diffraction.All the output neurons of layer l À 1 are added together to the i-th neuron ðx l i , y l i , z l i Þ of layer l: The neuron is transmitted to the p-th neuron ðx lþ1 p , y lþ1 p , z lþ1 p Þ in layer l þ 1 by modulation of the nonlinear unit.Each neuron in each layer propagates as described above.g represents the output of the nonlinear activation function in ReLU phase-limit N-D 2 NN, its function is to transmit modulated secondary wave neurons to the next layer through nonlinear cells.Monolayer perceptron can only solve linearly separable problems, so its learning ability is minimal.Deep feedforward neural networks introduce activation functions into hidden layer neurons, enabling them to learn more complex functional relationships, thus improving the learning ability and accuracy of the model.
In our previous work, we used the ReLU family function as the nonlinear activation function for the hidden layer in the original N-D 2 NN model.To make this work more detailed and informative, in this paper we use Tanh, Softsign, Swish, [29] Mish, [30] ReLU, [31,32] Leaky-ReLU, Parametric-ReLU (PReLU), [33] Randomized-ReLU (RReLU), eLU, [34] self-normalizing neural networks (SeLU), [35] and ReLU6 functions as activation functions of the hidden layer in the ReLU phase-limit N-D 2 NN model.The mathematical model of the activation function above is shown in Figure 2.

The model calculates performance evaluation indicators
In this paper, confusion matrix and classification accuracy are used as evaluation methods and indexes of the ReLU phase-limit N-D 2 NN model.The confusion matrix is a standard format for precision evaluation.Table 1 shows the confusion matrix of ten categories. [38]he accuracy rate represents the percentage of the total sample in which the predicted results are correct.In this paper, we need to calculate the confusion matrix C i (i ¼ 0-9).For every single class, the evaluation is defined by TP i : Therefore, the accuracy rate of the classifier can be expressed as: where TP i ¼ v ii represents the true value in C i and the quantity predicted by the model.N is the total number of samples per test.

Model hyperparameter selection
The hyperparameters in the ReLU phase-limit N-D 2 NN model include diffraction grating physical parameters and neural network training parameters, which are represented by Tables 2 and 3, respectively.In this paper, ReLU phase-limit N-D 2 NNs were simulated using Python (v3.6.8) and TensorFlow (v1.14.0,Google Inc.) framework.ReLU phase-limit N-D 2 NNs were trained for 50 epochs using a desktop computer with a GeForce GTX TITAN V Graphical Processing Unit, GPU, and Intel(R) Core (TM) i7-8700K CPU @3.37 GHz and 64GB of RAM, running Windows 10 operating system (Microsoft).

Simulation experiment results
In the ReLU phase-limit N-D 2 NN model based on 10.6 mm wavelength, After using the activation function in Section 3.2 as the activation function of the hidden layer of the model for training.Figure 3 shows the correspondence between epoch and accuracy at training time for different Table 2. Physical parameters of grating in ReLU phase-limit N-D2NN.

Grating parameters Numerical
Wavelength k 10.6 mm Pixel size 5 mm Spacing of diffraction grating 30k ReLU phase-limit N-D 2 NN models.From the training model in Figure 3, it can be seen that the model with the hidden layer as the ReLU family activation function still has a better inference performance, especially PReLU.Figure 4 represents the classification performance obtained by different the ReLU phase-limit N-D 2 NN on the two datasets.As can be seen from Figure 4, the ReLU phase-limit N-D 2 NN with the hidden layer as PReLU function obtained the highest accuracy of 98.38% and 90.14% on the MNIST and Fashion-MNIST datasets, respectively.The recognition accuracy of the MNIST and Fashion-MNIST datasets is increased by 0.57%, and 0.82%, respectively, which theoretically proves the correctness of the model.However, when Sigmoid and Softplus are chosen as activation functions for the hidden layer of the model, the ReLU phase-limit N-D 2 NN has the vanishing gradient problem.This may be because the Sigmoid function changes very little when it tends   to infinity, which is not conducive to the feedback transmission of the ReLU phase-limit N-D 2 NN model and easy to cause the vanishing of the gradient.In training of the ReLU phase-limit N-D 2 NN model, the Softplus function is prone to the phenomenon of vanishing gradient, or some neurons in the neural network may be "necrotic" due to excessive learning rate.Figures S1  and S2 show the confusion matrix images of the ReLU phase-limit N-D 2 NN model on the MNIST and Fashion-MNIST datasets, respectively, where different activation functions are chosen for the hidden layer.As can be seen from the confusion matrix images in Figure S1, except for the hidden layer of RReLU function, the rest of the models have a classification accuracy of more than 70% on each label of the MNIST dataset.The models with hidden layers Leaky-ReLU and PReLU functions are even more significant than 94%.Among them, the ReLU phase-limit N-D 2 NN model obtained the highest classification accuracy on the MNIST dataset with label 0 (number 0) and label 1 (number 1).In particular, the classification accuracy of the model is 100% for both labels, where it is the hidden layer chosen for PReLU.According to Figure S2, the ReLU phase-limit N-D 2 NN obtains the highest classification accuracy for label 1 (picture type is trousers) and label 8 (picture type is bag) in the Fashion-MNIST dataset, where various activation functions are chosen for the hidden layer.The models with hidden layers Leaky-ReLU and PReLU can achieve the highest 97% and 97% on label 1, and 98% and 97% on label 8.

Calculation performance comparison with other D 2 NN models
Figure 5 represents the classification accuracy of the Sigmoid phase-limit [9] and ReLU phase-limit N-D 2 NN models on the MNIST dataset and the Fashion-MNIST dataset, respectively.It can be   [7] 98.1 / In situ optical training of DONN [8] 97.48 / N-D 2 NN (Sigmoid phase-limit) [9] 97.86 89.28 DNN-NOM [12] 97.00 88.26 Nonlinear-TAPLM1 [13] 94.64 / Nonlinear-TAPLM2 [13] 95.82 / Nonlinear-OASLM [13] 94.85 / D 2 NN reduces the influence of gradient Vanishing (complex-valued) [15] 97.81 89.32 D 2 NN reduces the influence of gradient Vanishing (phase-only) [15] 97.18 89.13 Res-D 2 NN [16] 98.4 88.4 D 2 NN with OReLU function based on genetic algorithm [26] 97.97 87.85 Visible light D 2 NN [39] 91.57/ seen from Figure 5(a) that ReLU phase-limit N-D 2 NN has higher accuracy compared to Sigmoid phase-limit N-D 2 NN on the MNIST and Fashion-MNIST datasets.As shown in Figure 5(b), Except for the model with Leaky-ReLU function in the hidden layer, the recognition accuracy of the ReLU-limited phase N-D 2 NN model is higher than that of the Sigmoid-limited phase.It is proved that using the ReLU function to limit the phase can improve the calculation performance of N-D 2 NN.Table 4 shows the computational performance of different D 2 NN models.It can be seen from Table 4 that the ReLU phase-limit N-D 2 NN has good computational performance.Besides the Res-D 2 NN model (98.40%), [16] our model has the highest accuracy rate (98.38%) in the MNIST dataset classification.However, compared to the 20 hidden layers used in the Res-D 2 NN model, our proposed model requires only 6 hidden layers, and the accuracy obtained on the MNIST dataset is almost the same, while our model is less computationally intensive and takes less time for training and testing.In addition, among many models, our model has the highest accuracy rate (90.14%) in the category of the Fashion-MNIST dataset, which is the highest accuracy rate obtained by the known D 2 NN-derived model in the Fashion-MNIST dataset.

Conclusions
This paper proposes a method to improve the computational performance of the N-D 2 NN model.First, we introduce the role of error back-propagation for the N-D 2 NN model and the use of the ReLU function to mitigate the problem of vanishing gradient in the model.Next, we build the ReLU phase-limit model using a deep feed-forward neural network, the Rayleigh-Sommerfeld diffraction equation, and the ReLU limiting phase factor.Then, we judge the classification performance of the model on the MNIST and Fashion MNIST datasets by confusion matrix and accuracy.This ReLU phase-limit N-D 2 NN achieves higher accuracy.The ReLU phase-limit N-D 2 NN model with PReLU in the hidden layer achieves the highest accuracy of 98.38% (90.14%) on the MNIST (Fashion-MNIST) dataset.Finally, compared with other D 2 NN derived models, the ReLU phase-limit N-D 2 NN model has one of the highest computational performance models.The performance of our model on the Fashion-MNIST dataset is higher than all the currently proposed D 2 NN.The accuracy obtained by our proposed model on the MNIST dataset is only smaller than that of the Res-D 2 NN model.However, we obtain higher computational performance by using fewer diffraction layers.This study provides a theoretical basis and methodological guarantee for the realization of physical systems and devices based on a 10.6 mm wavelength nonlinear all-optical diffractive neural network.

Figure 1 .
Figure 1.Schematic diagram of ReLU phase-limit N-D 2 NN structure.(a) Using MNIST as a dataset to train the model.(b) Using Fashion-MNIST as a dataset to train the model.

Figure 3 .
Figure 3. Correspondence between epoch and accuracy at training time for different ReLU phase-limit N-D 2 NN models.(a) Using MNIST as a dataset to train the model.(b) Using Fashion-MNIST as a dataset to train the model.

Figure 4 .
Figure 4. Classification performance obtained by different the ReLU phase-limit N-D 2 NN on the the MNIST and Fashion-MNIST datasets.

Figure 5 .
Figure 5. Two kinds of functions are used to limit the classification accuracy of N-D 2 NN with phase on datasets.(a) MNIST.(b) Fashion-MNIST.
Additionally, activation functions normalize execution data, limit the expansion of data, and prevent overflow risks caused by excessive data.Neural networks with an activation function can complete the nonlinear transformation of data, solve the linear model expression, and classification problems.In previous studies,Wei etal.and Mengu et al. have previously discussed the nonlinearity of D 2

Table 1 .
The confusion matrix of ten classifiers.

Table 3 .
Training parameters in ReLU phase-limit N-D 2 NN.

Table 4 .
Computational performance of different D 2 NN models.