The unified effect of data encoding, ansatz expressibility and entanglement on the trainability of HQNNs

Recent advances in quantum computing and machine learning have brought about a promising intersection of these two fields, leading to the emergence of quantum machine learning (QML). However, the integration of quantum computing and machine learning poses several challenges. One of the prominent challenges lies in the presence of barren plateaus (BP) in QML algorithms, particularly in quantum neural networks (QNNs). Recent studies have successfully identified the fundamental causes underlying the existence of BP in QNNs. This paper presents a framework designed to explore the interplay of multiple factors contributing to the BP problem in quantum neural networks (QNNs), which poses a critical challenge for the practical applications of QML. We focus on the combined influence of data encoding, qubit entanglement, and ansatz expressibility in hybrid quantum neural networks (HQNNs) for multi-class classification tasks. Our framework aims to empirically analyze the joint impact of these factors on the training landscape of HQNNs. Our results show that the occurrence of the BP problem in HQNNs is contingent upon the expressibility of the underlying ansatz and the type of the adopted data encoding technique. Additionally, we observe that qubit entanglement also plays a role in exacerbating the BP problem. Leveraging various evaluation metrics for classification tasks, we systematically evaluate the performance of HQNNs and provide recommendations tailored to different constraint scenarios. Our findings emphasize the significance of our framework in addressing the practical success of QNNs.


Introduction
The quest to build practical quantum computers has intensified over the last few years.Several quantum devices, with around one hundred qubits, have already been developed.These devices are known as noisy intermediate-scale quantum (NISQ) devices [1].Although these devices are limited and susceptible to errors, they demonstrate clear advantage for specific applications as compared to the best existing classical computers [2][3][4][5].As the NISQ devices require quantum routines of shallow depth and robustness against noise, a hybrid design space integrating classical and quantum processing has become a leading approach to realize the potential of quantum computing for a wide range of applications [6].
In a hybrid design space context, variational quantum algorithms (VQAs) are the most popular class of algorithms.These algorithms utilize NISQ devices for evaluating the objective function through parameterized quantum circuits (PQCs) and classical devices for function optimization with respect to the target application.The VQAs have been studied for a wide range of applications, including quantum chemistry [7], state diagonalization [8], factorization [9], quantum optimization [10], and quantum field theory simulation [11,12].Furthermore, these algorithms have also been studied in the context of noise resilience [13], trainability [14][15][16] and computational complexity [17,18].In other words, VQAs closely resemble machine learning (ML) algorithms as they also train a computer to learn patterns [19].Therefore, VQAs have been proposed as a quantum analog of various ML algorithms [10,[20][21][22][23][24].Consequently, the new field of quantum machine learning (QML) has emerged by merging quantum computation and ML.
In recent years, a number of PQCs have been proposed for QML applications [25].Amongst them are the quantum neural networks (QNNs), which have extensively been explored [26][27][28][29][30][31][32][33] as quantum extensions of classical deep neural networks (DNNs).The basic building block in QNNs is quantum perceptron, which has been proposed in different ways [34][35][36][37][38][39][40].To this end, PQCs became a promising quantum analog of artificial neurons [41][42][43].Given the NISQ era limitations, hybrid quantum neural networks (HQNNs) are commonly used to analyze the potential quantum advantage in QNNs.HQNNs replicate the QNN's architecture by enclosing a typical QNN in some classical input pre-and output post-processing.The input preprocessing typically aims to downsize the input to cope with the limitation of NISQ devices (mainly in number of qubits), and is done via a classical neuron layer with fewer neurons or some dimensionality reduction algorithm.The output postprocessing is performed to interpret the output of enclosed QNN output in a meaningful way.The postprocessing is usually done via a classical neuron layer at the end, which also allows to apply the familiar non-linear activation functions to get the final output/prediction.Although QNNs are being extensively explored for various applications, the literature still lacks solid and concrete statements about their quantum advantage [44][45][46].

Barren Plateaus
One of the most challenging problems in HQNNs is the phenomenon of barren plateaus (BP) [14][15][16]47].In BP, the cost function landscapes during HQNN's training become exponentially flat with an increase in the system size.In other words, the gradients of parameters subject to optimization vanish exponentially as a function of the number of qubits.This implies that the existence of BP in HQNN's training landscapes affects their trainability 1 , resulting in a significant performance degradation and consequently limiting HQNN's applicability in practice.
It has been demonstrated in [14] that a sufficiently random ansatz (a quantum subroutine consisting of a sequence of gates applied to specific wires) will experience the BP if its uniform unitaries distribution matches up to the second moment, i.e., it forms a unitary 2-design.Therefore, the choice of ansatz is central to the success of hybrid quantum-classical algorithms.Some frequently explored ansatzes in this regard are the Hamiltonian variational ansatz [48], coupled cluster ansatz [49][50][51], quantum alternating operator ansatz [10,52] and hardware-efficient ansatz [53].Ideally, the ansatz is required to be both trainable and expressible to reach an optimal solution.The expressibility of the ansatz is specifically desired so that it can provide an accurate approximation to the solution.Simultaneously, the training landscapes also need to have accessible-enough features to find the solution.The expressibility of a quantum ansatz implies how uniformly the given ansatz can explore the unitary space [54].The authors in [54] extend the original work reporting BP [14] to ansatz expressibility.They propose the idea of problem-inspired ansatz rather than the hardware-efficient ansatz, which are of significant importance in the NISQ era.The expressibility of ansatz plays a significant role in overcoming BP to a certain extent.Ansatz with greater expressibility are more susceptible to BP and vice versa.Similarly, deep ansatz are considered to be more expressible [14].Moreover, ansatz expressibility can directly be derived from the nature of the target problem (more expressible ansatz for complex problems and vice versa).
In addition to ansatz expressibility, the problem of BP in HQNNs may also arise due to the type of entanglement used in PQCs [55], the noise levels in high-depth quantum circuits [1,47,[56][57][58] and how the data is being encoded into the PQCs [43].Entanglement is a fundamental property of quantum mechanics and is a key to constructing expressible quantum circuits.However, it can also be a potential source of BP, as discussed in [55].Data encoding is also a crucial step in HQNNs and is often considered to be the performance bottleneck, and it can also affect the trainability and expressive of HQNNs [59].

Research Gap
As discussed in Section 1.1, data encoding, ansatz expressibility, and the entanglement are simultaneously used in HQNNs.While the BP dependence on all these concepts has separately been explored in different studies [43,54,55,59], to the best of our knowledge, their joint (holistic) effect (with respect to each other) on the trainability of QNNs from the aspect of BP has not been investigated yet.Furthermore, existing work focuses on the standalone mathematical implementation (formulation or modeling) of QNNs, whereas the HQNNs are more relevant for practical applications on NISQ devices, which allows us to experiment with real-world datasets.Therefore, a framework for HQNNs that allows to simultaneously analyze the effect of data encoding, ansatz expressibility, and the entanglement between qubits on the trainability of HQNNs is needed.

Proposed Framework
In this paper, we propose a framework to perform an empirical analysis (based on data obtained from experiments) of the joint effect of data encoding, ansatz expressibility, and entanglement in HQNNs with respect to each another.We typically investigate the effects of aforementioned concepts in feed-forward HQNNs for a hardware-efficient periodic ansatz structure for a real-world application (multi-class classification).An abstract illustration of our analysis is depicted in Figure 1.
The classical to quantum feature mapping is achieved via the two frequently used data encoding strategies, called amplitude encoding and angle encoding.For the PQC, we consider two similar ansatz structures (to be used as hidden quantum layers in HQNN) differentiated only by the inclusion/removal of entanglement, which we name as entangled ansatz (shown as Nearest Neighbour Entanglement in Figure 1) and unentangled ansatz (shown as No Entanglement in Figure 1).Both the ansatz structures are separately experimented for each of the encoding scheme.We consider different width (n) of underlying quantum layers for the analysis of BP in HQNNs.Moreover, different depth (n) of quantum layers are considered for all the widths to analyze how the ansatz expressibility plays a role in trainability of HQNNs for both the ansatzes.
We benchmark accuracy and loss convergence as evaluation metrics for the models used in this article.The HQNNs (with certain n and m) achieving higher accuracy with faster and smoother convergence to the optimal solution are considered better Figure 1.: An Overview of Proposed Methodology than others.It typically implies that these models are not (yet) fully exposed to BP, and hence yields better performance.In addition to the accuracy and loss convergence, we also evaluate the HQNNs for some additional performance parameters.These additional performance parameters are precision, recall score and and F1-score.

Contribution
We perform an extensive list of experiments to study the combined effect of data encoding, ansatz expressibility and entanglement between qubits on the trainability of HQNNs on a real world dataset.We typically use handwritten digit dataset from sklearn [60].The reason behind selecting this particular dataset over other famous similar datasets like MNIST [61], is because of the smaller image size which is more suitable for NISQ devices.Based on the achieved results, we observe that the problem of BP arises in HQNNs also, as the number of qubits increase, resulting in performance degradation.Furthermore, the occurrence of BP is dependent on the ansatz depth (expressibility) irrespective of the data encoding strategy.Consequently, we perform the trainability and expressibility analysis for both the data encodings first for entangled and then unentangled ansatz.This analysis provides an idea about the appropriate depth of quantum layers for the given width, when working with realworld applications.Moreover, the obtained results also provide an idea about which encoding strategy is slightly more advantageous (from BP and trainability aspect) than the other.In addition, we also observe that the entanglement between the qubits in quantum layers also plays a role in the trainability of HQNNs.However, it's impact (positive or negative) on the overall performance of underlying model is dependent on how the data is being encoded.Moreover, we also evaluate the HQNNs in terms of different evaluation metrics for classification applications (precision, recall and F1-score), which signifies the relevance of HQNNs in real-world applications.Lastly, we illustrate the significance of our proposed framework by considering different constraint scenarios on the primary components of HQNNs (data encoding, ansatz expressibility and entanglement inclusion/removal),and provide recommendations for the optimal set of parameters.
It is important to note that the mathematical analysis of individual performance parameters (such as data encoding, ansatz expressibility, and the entanglement) exist in literature [43,54,55,59].Therefore, the mathematical formulation of various perfor-  mance parameters in terms of trainability is out of the scope of this paper.However, a combined effect of all these non-trivial components (performance parameters) of HQNNs for a practical application has not yet been explored.One possible reason behind the lack of such a unified analysis is that it is challenging to theoretically analyze the effect of all these concepts (with respect to each other) simultaneously in a single framework.However, experimental investigation can provide the leverage of such unified analysis.Therefore, we attempt to experimentally/empirically analyze their joint (holistic or simultaneous) effect from practical viewpoint.

Organization
The rest of the paper is organized as follows: Section 2 provides the necessary background of HQNNs along with its main components.The state-of-the-art on potential solutions and analysis of the BP problem in HQNNs is discussed Section 3. The detailed methodology for the framework development of HQNNs analysis is presented in Section 4. The experimentation details including the list of experiments performed is discussed in Section 5.The analysis of proposed framework based on the obtained results is discussed in Section 6. Section 7 illustrates the significance of our proposed framework for various design constraints of HQNNs.Finally, Section 8 concludes the paper.

Hybrid Quantum Neural Networks
In this section, HQNNs are discussed in analogy with classical deep neural networks.Furthermore, the main components of HQNNs in correspondence with what is used in this paper, i.e., data encoding, observable measurement and cost function details are discussed in Section 2.1 and Section 2.2 respectively.Lastly, the ansatz trianability and expressibility are discussed in section 2.3.
In a typical setting of classical DNNs (Figure 2a), the first step is to map the input data to the feature space via feature embedding layers F x (.).The embedded data is then trained through fully connected neuron layers l W l (.) to learn the inherent relationships between the input and output of a particular dataset.The number of layers and neurons in each layer can be customized according to the complexity of the target application.
The quantum counterpart of DNNs (i.e.QNNs) has recently attracted attention due to the tremendous success of DNNs, and improved computation power QNNs may offer [62].QNNs have a structure similar to DNNs, as shown in Figure 2b.Analogous to other QML algorithms, QNNs also exploit PQCs, which are parameters (classically optimizable) dependent quantum circuits.Combining both architectures results in HQNN.
The HQNNs work in five steps [63]: (1) Input Downscaling: The first step in HQNNs working is input downscaling to cope with the limitations NISQ devices mainly in number of qubits.(2) Feature mapping: Classical data points x are mapped to n−qubit quantum state |ψ represented by Equation 1, where S(x) is the mapping function.
(3) Training: The prepared quantum state is processed and trained using an ansatz U via a series of single and multi-qubit unitaries as shown in Equation 2. Rotation angle in the ansatz is parameterized by vector θ.(4) Measurement: The quantum state is then measured and the corresponding eigenvalue of the measurement observable O is obtained as shown in Equation 3. ( 5) Classical postprocessing of the observable: the quantum state measurement results in a classical value and hence can be processed by a classical device.The postprocessing of measurement results is commonly done via a classical neuron layer.It also allows to apply a familiar non-linear activation functions like SoftMax, and an optimization routine to minimize the cost function.The steps 2 − 4 mentioned above forms a typical architecture of QNN as shown in Figure 2b.Hence in a typical HQNN architecture, the QNN is completely replicated, making the analysis of quantum part (step 2 − 5 above) in HQNNs valid for QNN also.
Generally, in QML, the underlying PQCs are iteratively executed for an input x and a parameter vector θ to approximate its expectation value because of the probabilistic nature of quantum computation.In QNNs, this expectation value is usually considered as the output of the network [63].In the following subsections, we provide a brief overview of different parts of HQNNs in correspondence with the approaches we use in this work.

Data Encoding
In QNNs, inputting the data in a way that quantum circuit can process has been a pressing challenge, and is often termed data encoding.The data encoding can be considered as a feature map that maps an input feature x to the quantum system's Hilbert space, thus creating a quantum state |ψ x [64].The encoding process is a crucial step while designing quantum algorithms and can significantly effects their performance [27,[64][65][66].In practice, the transformation (x −→ |ψ x ) is typically achieved through a unitary transformation (S x ), implemented using a variational circuit, whose parameters are dependent on the input data being encoded [27,64,67].
The circuit (S x ) then acts on the initial state |ψ , which is usually a ground state, i.e., |ψ = 0 ⊗n .The encoding is then realized as in Equation 4. x However, the transformation circuit S x is required to be hardware-efficient to accommodate the limitations imposed by the NISQ regime.Several encoding strategies have recently been proposed [64,66,67].However, in HQNNs, the frequently used ones are amplitude and angle encoding.

Amplitude Encoding
In amplitude encoding, data is encoded into quantum state amplitudes.For x ∈ R n , the amplitude encoding maps x −→ E(x) into the amplitudes of an n−qubit quantum state as shown in the Equation below: where N = 2 n , x i is the i th element of x and |i is the i th computational basis state.For a classical dataset D with M examples and N features as shown in Equation below: where x (m) is an N −dimensional feature vector for m = 1, . . .M .The amplitude can then easily be understood by concatenating x (m) in a single vector as follows: N , x 1 , . . ., x N , x The factor C norm is the normalization factor and must be normalized such that |α| 2 = 1.Then the input can be represented in the computational basis using the following Equation: where, α i represents amplitude vector elements (α) and |i are the computational basis states.The total number of amplitudes being encoded are N × M .A system of n−qubits can encode 2 n features.Therefore, amplitude encoding requires n ≥ log 2 (N M ).The constants can be padded if the total number of features being encoded are greater than 2 n [65].

Angle Encoding
Angle encoding, also called qubit encoding, encodes the input data features into rotation angle of qubits and has been used in various QML algorithms [65,68,69].For a feature vector x = [x 1 , x 2 , . . .x N ] T ∈ X N , the following Equation typically represents the angle encoding.
The angle encoding encodes N features into an n−qubit system.The state preparation unitary for a single feature per qubit in qubit encoding is represented by the following Equation: The angle encoding approach can also be generalized to encode two features in a single qubit; this is called dense qubit encoding [67].However, in this work, we adopt simple qubit encoding.

Observable Measurement and Cost Function
The qubits can be measured in different measurement bases.In this work, we use the eigenbasis of σ z for the expectation value of our PQC.For an n−qubit system, the observable measurement in σ z basis can be described as the tensor product of n Pauli-Z matrices i.e., O = σ ⊗n .The σ z observable returns −1 for odd parity quantum state and 1 for even parity, keeping the overall expectation value of PQC in the range [−1, 1].
The training of quantum layers HQNNs is subjected to finding the parameter vector θ that minimizes the loss after every training iteration.Since we consider a multi-class classification problem, we use sparse categorical cross entropy as a cost function, as shown in the Equation below: where, y i are true labels and ŷi are predictions.Classical optimization techniques, like gradient descent, can be used for the optimization of the cost function, which simply takes the partial derivatives of each parameter and decide the next minimum direction.However, the output of PQC, i.e. expectation value of measurement observable, needs to be differentiated for every parameter in the variational ansatz.The expectation value of a PQC can be differentiated with respect to each parameter using the parameter shift rule (Equation 12), first introduced for QML algorithms in [31] and extended in [70].
where, s is the macroscopic shift and is determined by the corresponding gate's eigenvalue, which is parameterized by θ i .

Ansatz Trainability and Expressiblity
The ansatz can be thought of as a PQC consisting of single-qubit parameterized gates.It may or may not contain multi-qubit (entangling gates) depending on the target problem.These parameterized gates are dependent on adjustable parameters.In the context of HQNNs (or QML in general), these parameters are trained in data-driven tasks, which is analogous to the case of classical NNs.The network is said to be trainable as long as the optimization algorithm is able to minimize the loss (as per the defined cost function) in every training iteration.The trainability of the network is then compromised when the gradients of parameters are not accessible for further optimization, and the optimization algorithm can not reach the optimal solution.PQC can be considered expressive, if it can be exploited to uniformly explore the unitary group U(d).Thus, the expressiblity of PQC can be defined in terms of the following super-operator [54].
where dµ(V ) is the volume element of Haar measure and dU is the volume element corresponding to to uniform distribution over U.If A U (t) (X) = 0 for all operators X then averaging over elements of U agrees with averaging over Haar distribution up to the t-th moment.In this case, U forms a t-design.

Related Work
In this section, we provide details of recent state-of-the-art solutions (mainly inspired by the BP problem in classical NNs) to potentially overcome the issue of BP in QNNs aiming to enhance their trainability.Additionally, some potential sources of BP in QNNs training landscapes are also discussed, where the BP problem is analyzed from the aspect of underlying ansatz.
The BP phenomenon in QNNs was first studied in [14], where the random PQCs were initialized, and then the variance of partial derivatives was calculated i.e., var[∂C] = (∂C 2 ) − (∂C) 2 .The authors then show that for deep circuits of order poly(n), the variance exponentially vanishes with the number of qubits i.e., var[∂C] = 1 2 n , due to the fact that the circuit forms 2-design (sample all the unitaries in Hilbert space).
Unlike the classical case, where a gradient-based backpropagation algorithm solves this trainability problem, its quantum analogous solution is challenging to implement [71].In HQNNs, the PQC is run on a quantum device, whereas its optimization is performed classically.The composition of two fundamentally different computation approaches makes it challenging to implement the backpropagation algorithms in QNNs.Recently, the problem of BP in QNNs has caught significant attention.While some proposed solutions to potentially overcome the BP problem, others analyze the fundamental structure of QNNs to determine the parameters which give rise BP problem.In the following sections, we present some recent state-of-the-art focusing on tackling and analyzing the issue of BP in QNNs.

Potential Solutions of BP
Several solutions were recently proposed to potentially overcome the BP problem in QNNs.In [68], a small portion of the quantum circuit is initialized randomly while the remaining parameters are carefully chosen to implement the identity operation as a whole.This approach avoids the initialization on a plateau only for the first training step, and the learnability is still affected during the subsequent training iterations.Similarly, in [72], a strategy to avoid BP was introduced by enforcing the assignment of multiple parameters in the circuit to reduce the total number of parameters subject to training.However, this approach limits the optimization process to a particular set of parameters and eventually increases the circuit depth for convergence.
Inspired by the layer-wise training of classical NNs, which were shown to potentially prevent the BP caused by random initialization in classical NNs [73], a similar approach has also recently been used in QNNs [71].The layer-wise training approach in QNNs focuses on training a small subset of parameters in each training iteration by incrementally increasing the circuit depth, which results in larger gradients magnitude because of a smaller number of parameters as compared to training the complete circuit.
A novel approach for mitigating BP in QNN is recently proposed in [74], where the residual approach from classical NNs is exploited in QNNs.The study demonstrates that incorporating the residual approach in QNNs leads to a significant enhancement in their training performance.
Since BP is fundamentally the problem of vanishing gradients, it may seem that gradient-free optimization approaches can help overcome this issue.However, it has recently been proved that even gradient-free optimization cannot escape the BP issue in QNNs [75].In fact, it was shown that the cost function differences (deciding factor to make optimzation decisions in gradient-free approaches) are exponentially suppressed in BP.

Analysis of BP
BP is a problem corresponding to the cost function landscapes, where the partial derivatives of parameters become exponentially flat with system size.As mentioned earlier, the initial study discusses the occurrence of BP for deep ansatz, which usually is believed to be more expressible.
A recent study investigating the existence of BP for shallow circuits in comparison with deep circuits is in [76].The authors show that by making BP dependent on the cost function, it can be extended to shallow circuits.To this end they study two cost functions namely: local cost function (measuring single qubit in a multi-qubit systems) and global cost function (measuring all qubits in a multi-qubit systems).The authors then conclude that for the global cost function, the QNNs will experience BP irrespective of underlying PQC's depth On the other hand, in the case of local cost function, the gradients vanish at worst polynomially and are therefore trainable up to a depth of order (O(log(n))).The BP starts appearing for depth of order (O(poly(n))) for the local cost function and in between these regions, there is a transition region where gradients decay from polynomial to exponential.In a recent study conducted by [77], a follow-up investigation was conducted on the globality and locality of cost functions in real-world applications.The study posits that, in the context of multi-class classification, the use of a global cost function results in significantly improved performance as compared to local cost functions.However, for binary classification, both global and local cost functions demonstrate similar levels of effectiveness.These findings suggest that the choice of cost function must be carefully considered when designing classification models for multi-class scenarios, whereas in the case of binary classification, either approach may be equally viable.
Recent work [54] analyzed the ansatz expressibility to gradient magnitudes and showed that the more expressible the ansatz is, the more likely it is to have a BP.On the other hand, ansatz with relatively lower expressibility would have a delayed BP up to a certain depth, but in principle, expressible ansatz is still favorable because they might provide solutions for multiple problems as compared to less expressible ansatz, which would be problem specific.Therefore, the greater expressibility in deep ansatzes makes them more susceptible to experiencing BP during training.
Excess entanglement in the hidden layers of QNNs can also cause BP and hinder the learning process [55].Exploiting the volume law from quantum thermodynamics in [55], the authors show the existence of BP in the cost function landscape for both gradient-based and gradient-free optimization approaches.This observation was made for both feedforward QNNs and quantum Boltzmann machines.Moreover, the way of encoding data into the QNN, can also give rise to BP [43].
In this paper, we do not intend to propose a solution for BP problem.Instead, we focus on the analysis of various components of HQNNs inline with section 3.2.

Methodology
This section presents the detailed methodology to obtain the relevant results for the proposed analysis.We start with providing the details of dataset used for training the HQNNs and how is it preprocessed to cope up with the limitations of NISQ devices.Afterwards, the details of QNNs construction right from qubit initialization to final measurement are presented.Finally, the discussion on methodology is concluded by providing the details on the classical postprocessing of QNN results.
The typical architecture of HQNN for the unified analysis of joint effect of Data Encoding, Ansatz Expressibility, and Entanglement on the Trainability, used in this paper, is depicted in Figure 3.The HQNN architecture completely replicates QNN's Figure 3.: Proposed Methodology for the unified analysis of HQNN architecture from Figure 2b.We call the architecture (used in this paper) hybrid because of two reasons; 1) the optimization is classically performed, which is also the case in general standalone QNNs (in NISQ regime).Secondly, we add a classical neuron layer at the end of the quantum layer(s).The advantage of the classical layer is that it allows the application of familiar non-linear activation functions from traditional ML.Furthermore, it enables to experiment on real-world datasets.The HQNN used in this paper have three primary ingredients namely; data preparation, QNN construction, and classical post-processing.Here, we present the typical workflow of HQNN used in this paper.In particular we present the details of every step of HQNN, from input to the output.

Data Preparation
The first step in the workflow of HQNN used in this paper is to prepare the data such that it can be efficiently encoded into quantum states for further processing.The data preparation here is in context of the limitations of NISQ devices (mainly in number of qubits).Depending upon the size of input data and choice of encoding scheme, it is sometimes required to reduce the input features dimmension.In such a case, the input (image(s) in our case) is first passed to the dimensionality reduction algorithm before encoding.We also use dimensionality reduction for one of the encoding schemes to cope up with the restrictions of NISQ era, details of which are presented in the following section, where we explain data encoding used in this paper as part of QNN construction.

QNN Construction
Once the data is prepared, the next important step in HQNN workflow is the construction of QNN, which again has four ingredients namely; qubit initialization, data encoding, unitary evolution and qubit measurement .

Qubit Initialization
It is the first step of QNN construction, which typically defines the total number of qubits or in other words, the width of quantum layers.A qubit can be initialized in any random state.However, we initialize all the qubits in the default ground state i.e., |ψ ⊗n = |0 ⊗n , which is a relatively common practice.

Data Encoding
Once the qubits are defined, the next step is the data encoding (classical to quantum feature mapping), such that the subsequent quantum layer(s)2 can process the input.We use two frequently used data encoding strategies in QNNs, i.e., amplitude and angle encoding.As discussed above the limited number of qubits in NISQ devices enforces to reduce the input feature dimension, which depends on the input feature size and the way data is being encoded in to quantum states.The input feature dimension for the dataset we have used is 64 (details in section 5).In context of the data encoding used in this paper, we reduce the input feature dimensions every time the angle encoding is used because angle encoding needs n qubits to encode n features (see section 2.1).
For amplitude encoding, the input features dimensionality reduction is not performed because the amplitude encoding can encode 2 n features in n qubits (see section 2.1).

Unitary Evolution
The unitary evolution primarily consists of single-qubit parameterized (parameter dependent) gates and multi-qubit gates (for qubit entanglement).There is no hardand-fast rule to insert a combination of gates for unitary evolution and different gate combinations are heuristically used for a given problem.We use two similar ansatz structures (entangled and unentangled ansatz) for unitary evolution differing only in entanglement inclusion/removal for unitary evolution.The complete details, from data encoding to unitary evolution to measurement for the quantum layers we have used in this paper, are presented below: 4.2.3.1.Entangled Ansatz..The entangled ansatz structure consists of a singlequbit parameterized unitaries (R y (θ)) and two-qubit unitaries (CNOT) for qubit entanglement.Taking into account the limitations of NISQ devices, in this paper we only consider the nearest-neighbor entanglement, where the last qubit in the underlying system is considered neighbor to the first qubit, as shown in Figure 4. .Entanglement is an important property in quantum mechanics with an anticipated potential to enhance various applications including quantum machine learning.We now update our ansatz structure containing only the single-qubit parameterized unitaries with no entanglement as shown in Figure 5.The motivation of entanglement exclusion comes from the fact that in QNNs trainable parameters are only the single-qubit unitaries, which are optimized during the training.
We attempt to answer the question: does the unentangled ansatz can potentially delay the issue of BP to enhance the trainability of HQNNs because if it is also identified as a potential source of BP?We show that whether or not entanglement is significant in HQNNs and is dependent on the type of data encoding.For angle encoding, the entanglement does not aid towards better performance, in fact without any entanglement the overall performance improves.On the other hand, for amplitude encoding, the result shows the opposite case.For qubit measurements, different measurement bases can be used; however, the most frequently used are the default computational basis, i.e., the eigenbasis of σ z .We also use the computational basis to get the output of QNN.

Classical Post Processing
The last step in HQNN workflow is classical post-processing of QNN because the result of qubit measurement is a classical value rather than a quantum state.To post-process the QNN results, we use a classical dense neurons layer (every neuron is connected to every qubit's measurement output from PQC).The cost function is then defined based on which the optimizer tends to find the solution for the given problem.The optimizer updates the trainable parameters in every training iteration and retrains the PQC on updated parameters.The process is repeated until an optimal solution is found.
In this paper, the classical specifications (input, neuron layer at the end with non linear function, and classical optimization) of HQNN are fixed, and the primary focus is to experiment with enclosed QNNs (quantum layers).Until here, all the ingredients to construct the HQNN (used in this paper) are presented.The following section present the experiment details including the classical components used in HQNN.

Experimental Setup
The architectural details of HQNN used in this paper are presented in previous section (section 4).This section provides the experimentation details about how that proposed HQNN architecture is trained and evaluated.In particular, the details about the dataset used for training, details of classical hyperparameters of HQNN and list of experiments performed to obtain the results for proposed analysis is also presented.

Data Preprocessing
We consider a multi-class (image) classification problem.The sklearn digit dataset is used for training and evaluating the HQNN, which contains images of handwritten digits [78].The reason of using this particular dataset over other similar yet popular datasets such as MNIST, is the smaller input feature size, which is more suitable in NISQ era.This dataset has a total 10 classes with 180 samples per class, resulting in a total of approximately ∼ = 1797 samples.Furthermore, each data point is an image of 8 × 8 resulting in feature dimension of 64.Moreover, 75% of the samples are used for training and the remaining 25% for testing the HQNN.The data is encoded using both amplitude and angle encoding separately for each experiment.As discussed in section 4, we reduce the input feature dimension before passing it to PQC, in case of angle encoding, where input features must be equal to that of total number of qubits.For input dimensionality reduction, we use principal component analysis (PCA) from sklearn library (a popular ML library for the Python programming language), which reduces large datasets to smaller ones without losing the most important information in the dataset.

Hyperparameters Specifications
Given the nature of target problem (multi-class classification), we use categorical cross entropy as the cost function.For classical optimization of the cost function, we use Adam Optimizer [79], with an initial learning rate of 0.01.Since the dataset we used consists of a total 10 classes, therefore the last classical neuron layer in HQNN has a total of 10 neurons.Furthermore, the non-linear activation function used is SoftMax, which is general case in multi-class classification problems. .The models used in this paper are set to train for the maximum of 100 training iterations (epochs).However, to avoid overfitting, we schedule the learning rate as the training progresses using the early stopping method from keras library (an open-source software library that provides a Python interface for artificial neural networks).The early stopping method monitors the validation loss for three consecutive iterations and if there is no improvement, the learning rate is reduced by a factor of 0.1, setting the new learning rate as: newlearningrate = previouslearningrate×0.1.If the validation loss is not improved for four consecutive training iterations, the early stopping method forcefully stops the training to avid overfitting.The input training data for all the experiments is passed in the batches of 16.Finally, we use pennylane (a cross-platform Python library for differentiable programming of quantum computers) [6] for training the HQNNs.

List of Experiments
The list of training experiments for our proposed analysis is shown in Figure 6.Since the structure and size of the quantum layers are the primary focus of our analysis, we experiment with different widths (n) and depths (m) of quantum layers.n denotes the number of qubits, and m denotes the periodic repetition of quantum layers before the measurement.We restricted to use the maximum width of n = 14 and the maximum depth of m = 10 because of the two reason: 1) the overall accuracy is starting to decline for bigger n and m which we speculate would further decline by increasing n and m because of the so-called phenomenon of BP, and 2) Since we use classical machines to simulate qubit systems, and even for a simple dataset (considered in this paper), it takes around 70 − 80 hours to train for the maximum width (n = 14) with maximum depth (m − 10).Stemming from the fact that BP exists in standalone QNNs for sufficiently random and expressible PQCs [14], we first empirically analyze the existence of BP in HQNNs.
We typically analyze whether or not, the addition of classical neurons layer can mitigate BP to any extent.For that purpose, we train the HQNN for different widths (n) and depths (m) of the underlying PQC, Bigger values of n and m results in greater number of trainable parameters and generally considered as more expressible ansatz and vice versa.Changing the n and m of PQC helps to analyze the existence of BP in HQNNs.The issue of BP is analyzed for both the encodings separately with both ansatz structures (entangled and unentangled).This provides a general idea of which encoding strategy works well with which ansatz.Upon observing the BP dependence on ansatz depth (expressibility) we then perform trainability vs. expressibility analysis, again for both encodings individually for each ansatz structure.This analysis provided an overall idea about how the ansatz expressibility is playing the role in the occurrence of BP and how deep an ansatz can be for a given width before the performance starts declining.Further, it can also help to identify if there is any relationship between n and m to achieve a relatively better overall performance.We then compare the performance of both the anstaz structures for both the encoding to understand the role of entanglement in HQNNs.We conclude our analysis by providing a brief application mapping of HQNNs with other important evaluation metrics for classification tasks, i.e., precision, recall and F1-score.

Results and Discussion
In this section, we present our experimental analysis.We start by demonstrating the existence of BP in HQNNs for both the ansatz structures individually with both the encodings used in this paper.Based on our analysis, we concur that BP depends on expressibility of quantum layers in HQNNs.We then perform the the trainability vs. expressibility analysis of HQNNs i.e., how the quantum layer(s) depth and width are related to BP and possibly effect the network's overall performance.The trainability vs. expressibility analysis reveals that the entanglement in underlying ansatz also plays a role in the overall performance of HQNNs.We then briefly compare both ansatz structures with both encodings used in this paper, to understand the role of entanglement.Finally, to diversify our analysis, we evaluate the HQNNs in terms of other application-oriented evaluation metrics for classification tasks.

Demonstration of BP Existence in HQNNs
In this section, we analyze the existence of BP in the training landscapes of HQNNs.We benchmark mean accuracy of HQNNs for the analysis of BP.The BP analysis is performed for both the encodings individually for both entangled and unentangled ansatz structures.The existence of BP hinders the learning process of HQNNs eventually resulting in lower accuracy.Consequently, for our analysis the higher accuracy would essentially entail that the model is not yet fully exposed to BP and vice versa.

BP Demonstration for Entangled Ansatz with Amplitude Encoding
The entangled ansatz contains single qubit unitaries and nearest neighbor qubit entanglement, as shown in Figure 4.The original input feature size to be encoded into quantum states is 64, as discussed in section 5. Hence, with amplitude encoding, the minimum number of qubits required to encode these features is n = 6, which is a reasonable width for quantum layer(s) considering NISQ era.Therefore, while using amplitude encoding we do not apply PCA for feature reduction and all the input features are directly encoded.We then vary n and m according to Figure 6, to analyze the occurrence of BP.The mean accuracy for all the experiments are shown in Figure 7.To analyze the effect of width (n) irrespective of depth (m), the mean accuracy of different training experiments is individually plotted for fixed m and variable n in Figure 7a.The general accuracy trend in Figure 7a is declining.This decline in performance, as the number of qubits (n) increase, indicates the possible presence of BP in training landscapes.On the other hand, to observe the effect of depth (m) irrespective of the width (n), the mean accuracy is individually plotted for fixed n and variable m as shown in Figure 7b.we observe that there is a trade-off between quantum layer's depth and width to achieve a relatively better performance.The term trade-off here essentially means that for smaller n the allowable m is relatively greater and vice versa.One possible reason behind the trade-off is the phenomenon of overparameterization, which in HQNNs (used in this paper), is a result of an increase in n and m.Overparameterization is vital in classical deep learning because of the complexity of modern day applications, and is often helpful in learning most intricate relationships in the input data.However, HQNNs are not much in favor of overparameterizing the network, and an optimal number of parameters (optimal n and m) must be determined to achieve a relatively better performance.

BP Demonstration for Entangled Ansatz with Angle Encoding
In angle encoding, the number of qubits or quantum layer's depth (n) must be equal to that of input feature dimension for successful classical-to-quantum feature mapping, as discussed in section 2.1.Since the input feature size of our data is 64, we need at least 64 qubits to directly encode the data into the quantum states.Such a big number of qubits is not suitable for NISQ era.Therefore, we apply PCA to reduce the input feature dimension to experiment with angle encoding approach.The reduction conforms with the width of quantum layers (n) from Figure 6.However, considering both the NISQ limitations and not losing much information while dimensionality reduction, the minimum width in angle encoding is n = 8.Analogous to amplitude encoding, various experiments are performed for different n and m.The mean accuracy for all the experiments is shown in Figure 8.In classical ML, it is always conjectured that when we have more input features to train the model, the overall performance is improved, given a fairly complex underlying model.In angle encoding, the size of input features is directly proportional to the size of n.It typically means that every increase in n results in greater input feature dimensions.Hence, following traditional ML, the wider quantum layers should yield a better performance, since they are fed with relatively enhanced feature dimensionality compared to quantum layers of shallow width.However, this is not the case in HQNNs because an increase in n reduces the accuracy after a certain m, as shown in Figure 8.To analyze the effect of quantum layer(s) width, the mean accuracy is plotted as a function of n, as shown in Figure 8a.It can be observed that as n increases the accuracy tends to reduce, particularly for bigger depths (m = 8 and 10).For relatively smaller depths (m = 4, 6), the accuracy improvement is negligible as n increases exhibiting the presence of BP.To analyze the effect of quantum layer(s) depth, the mean accuracy is plotted as a function of m, as shown in Figure 8b.It can be observed that in general, for a certain n the accuracy tends to improve up to certain m and then starts declining.Hence, in case of angle encoding also, there is a trade-off between n and m to achieve a better overall performance.

BP Demonstration for Unentangled Ansatz with Amplitude Encoding
For amplitude encoding, the input feature size is unchanged i.e., 64.We start with minimum required width of quantum layers (n = 6) and perform all the experiments from Figure 6.For the sake experimental simplicity, we skip experimenting for n = 14 because we already observed the BP going from n = 6 to n = 12.The mean accuracy for all the training experiments are shown in Figure 9.The mean accuracy of all the training experiments is plotted as a function of n in 9a to analyze how the increase in number of qubits effect the performance of HQNN when no entanglement is included in quantum layers.It can be observed that analogous to the case of entangled ansatz, the accuracy starts declining as n increases irrespective of m.The accuracy decline is a clear indication towards the occurrence of BP.Furthermore, in unentangled ansatz the accuracy decline is quite evident with increases in n leading to a conclusion that no entanglement in quantum layers makes HQNNs more susceptible to BP.Similarly, the mean accuracy of all the training experiments is also plotted as a function of m (Figure 9b) which helps in analyzing the role of m in HQNNs for a fixed n.We observe that unlike the case of entangled ansatz with amplitude encoding, where the allowed circuit depth m varies for different n, in case of unentangled ansatz, increase in m results in further reduction (or a negligible improvement) in accuracy for all n.The performance decline with an increase in m is more evident for bigger n, leading to a conclusion that in case of unentangled ansatz smaller n with smaller m is more appropriate while constructing quantum layers.Bigger n and m makes unentangled quantum layers more prone to BP.

BP Demonstration for Unentangled Ansatz with Angle Encoding
Analogous to the BP analysis in case of entangled ansatz with angle encoding, the input feature dimension is reduced to conform with the reasonable width of quantum layer(s).We start with the minimum of width of n = 8.The mean accuracy for all the experiments is shown in Figure in   Analyzing the effect of n from BP aspect, the accuracy improves in general, as shown in Figure 10a.This is opposite to the results of entangled ansatz (Figure 8a), where the accuracy starts decreasing after certain m, despite an increase in number of input features (with an increase in n).This leads to a conclusion that unentangled ansatz is less susceptible to BP than entangled ansatz when encoding the data in qubit rotation angles.The results in figure 10b helps analyzing the effect of m on HQNN performance when the data is encoded via angle encoding.It can be observed that the accuracy deteriorates in general with an increase in m.The smallest depth (m = 2) performs better than all other m, for all n except n = 12.however, even in that case (n = 12) the accuracy improvement from m = 2 to m = 10 is not quite significant.

Trainability Vs Expressibility of HQNNs
A more careful analysis of results in previous section shows that the decline in performance does not have a consistent relationship (quadratic, exponential etc.) but is dependent on the quantum layer(s) depth m.Consequently, a more insightful analysis of how deep quantum layer(s) can be, for a given width before experiencing the BP and eventually leading to the performance decline, would be important.In this section, we conduct an empirical analysis on trainability vs. expressibility of HQNNs, again for both ansatz structures individually with both encodings.We benchmark model's overall accuracy and loss convergence as evaluation metrics for trainability vs. expressibility analysis.

Trainability Vs. Expressibility of Entangled Ansatz with Amplitude Encoding
The trainability vs. expressibility analysis of HQNNs gives an insight of how expressible a given ansatz can be (before the gradients start vanishing) for a particular width with better or at least equivalent performance.We performed different training experiments for different n and m, as shown in Figure 6.The model is evaluated based on the overall accuracy and convergence.The results are shown in Figure 11.In Figure 11, for n = 6 the deeper quantum layers i.e., m = 8 and m = 10 has better results in terms of accuracy, however, convergence is better in case of 10 layers as shown in Figure 12a.When we increase n to 8 and 10 qubits, the circuit depth of m = 6, 8, 10 have relatively better accuracy.However, in both cases (n = 8 and n = 10) the model converges relatively faster when m = 6 as shown in Figure 12b and  12c, respectively, reducing the appropriate depth of quantum layers when more qubits are used.When n is increased to 12 qubits, the appropriate circuit depth is further reduced to m = 4, both in terms of accuracy and convergence, as shown in Figures 11  and 12d, respectively.However, the corresponding reduction in quantum layers depth becomes more evident when we further increase n to 14 qubits, as the model clearly converges faster for m = 4 as shown in Figure 12e.
Considering the results shown in 11 and 12 we concur that the BP phenomenon in HQNNs is not only the function of qubits (gradients vanish exponentially with number of qubits and the network becomes untrainable as we increase the qubits) but it is also dependent on how expressible the quantum layer(s) are.If we analyze the accuracy in Figure 11 as a whole, we observe that as we increase n the performance starts deteriorating.Although, we observe a trade-off between n and m (bigger n tends to reduce m and vice versa), a careful analysis of individual accuracy reveals that smaller n and relatively bigger m are more appropriate to achieve better performance because the achieved accuracy is higher and are skewed on the higher side.For instance, for n = 6 and m = 8 almost 75% of the accuracy are higher than the highest achieved accuracy for n = 14 and m = 4. Furthermore, the accuracy for bigger n have more variance (more spread in accuracy) than smaller n, showing non-robust learning, in case of wider quantum layers.

Trainability Vs. Expressibility of Entangled Ansatz with Angle Encoding
We now present the analysis of how a different data encoding strategy i.e., angle encoding, where the input data features are encoded in qubit rotation angles can affect the trainability in QNNs for certain n and m.We encode the data in RY rotations which is then passed to the quantum layer(s).For the case of angle encoding, as explained in Section 2.1 the number of features being encoded should be equal to that of the number of qubits.Our input features are images of size 8 × 8 and hence 64 features in total.We need 64 qubits to encode 64 features which is not suitable in NISQ era.Therefore, we apply PCA and reduce the input feature dimension to 8, 10, 12 and 14 features.The quantum layer(s) width (n) is equal to that of dimension of input features, and we experiment with different quantum layers having variable depths (m = 2, 4, 6, 8, 10), similar to that of analysis with amplitude encoding and shown in Figure 6, except for n = 6, because We tend to keep n bigger while applying PCA so that, not much information is lost.The accuracy and loss for all the experiments are shown in Figure 13 and 14 respectively.Based on the results obtained, we observe that for n = 8, circuit of depth of m = 8 is better in terms of both accuracy ( Figure 13 and convergence (Figure 14a) whereas the smallest depth quantum layer (m = 2) is worst among all different values of m..When the circuit width is increased to n = 10 is, the circuit depth with better performance both in terms of better accuracy and convergence is m = 6 reduced from m = 8.Even though the input feature dimension is also increased (from 8 to 10), the circuit depth tends to reduce mainly because the gradients magnitude gets smaller.For n = 12, almost all the layers except m = 2 have comparable performance in terms of accuracy, but m = 6 has slightly better convergence rate than others.However, the individual accuracy dropped from n = 8 and n = 10.Similarly, for n = 14, circuit depth of m = 14 is betters in terms of both accuracy and convergence rate.Unlike amplitude encoding, the allowable circuit depth is more for bigger n, when the data is encoded in qubit rotation angles, and this is because of the greater number of input features to train as compared to smaller n.

Trainability Vs. Expressibility of Unentangled Ansatz with Amplitude Encoding
We now present the trainability vs. expressibility analysis for unentangled ansatz, while encoding the input data in qubit state vector.As discussed previously, for our dataset, the input dimensionality reduction for amplitude encoding is not required, and the PCA is therefore not applied.The experiments are the same as for entangled ansatz, i.e. changing the width and depth of quantum layers as shown in Figure 6.
Based on the obtained results, we observe a consistent decline in performance as n increases, we therefore skip the experimentation for n = 14 and all corresponding m.The accuracy and loss trends for unentangled ansatz are presented in Figure 15 and 16, respectively.
Figure 15.: Unentangled Ansatz: Accuracy trends for all n and m for amplitude encoding.
We observe that, unlike the case of amplitude encoding in entangled ansatz, where for smaller width n the allowed circuit depth m is relatively bigger, in unentangled ansatz, i.e. when the entanglement is removed, the allowed circuit depth tends to become lower even for smaller n.For n = 6, almost all corresponding depths m, yield somewhat same performance both in terms of accuracy and loss convergence as shown in Figure 15 and 16a respectively.However, the smaller depth (m = 2) is slightly better than other m, in contrast with entangled ansatz, where m = 8 and 10 turned out to be better for n = 6 in terms of both the performance metrics (Figure 11 and 12a).Similarly, for n = 8 the relatively smaller depth quantum layers typically m = 4 and 6 are slightly better than other depths, with respect to both accuracy and convergence as shown in Figure 15 and 16b respectively, which is again in contrast to entangled ansatz, where n = 8 have better performance with maximum depth (m = 10) and worst for smallest depth (m = 2) as presented in Figures 11 and 12b.When the n is further increased to 10 and 12 the overall performance deteriorates (Figure 15), however, analogous to the results in entangled ansatz for bigger n, relatively smaller depths, typically m = 2 and 4 are relatively better than greater depth layers.
Furthermore, as n increases, the individual performance for all m in both ansatz structures becomes more inconsistent, as shown in Figure 15 and 11, however this inconsistency is more prominent in case of unentanglement ansatz in HQNNs for amplitude encoding.Also, in case of no entanglement in quantum layers, for bigger n almost all circuit depths have a fluctuating journey to reach to a minimum value of cost function landscape, clearly exhibiting the existence of BP (unable to determine the cost minimizing direction).It is also worth mentioning here that for all n and m in unentangled ansatz, the HQNNs stuck in the local minima and fails to converge even after 100 training iterations, and is insensitive to the regularization techniques (we used early stopping in this work to avoid overfitting), whereas in case of entangled ansatz, the model is more robust and and have relatively better and smoother journey to converge to minimum in a cost function landscape, exhibiting relatively lesser insensitivity to regularization.We observe that the removal of entanglement in case of angle encoding affects the trainbility vs expressibility of HQNNs.The loss convergence in all n and m (Figure 18), is not quite distinguishable and therefore we observe accuracy (Figure 17) to analyze the effect of n and m on trainability.Unlike angle encoding for entangled ansatz, where the expressibility tends to slightly reduce for bigger n even with an increase in input features, for unentangled ansatz, the allowed circuit depth is greater, eventually leading to have more trainable parameters in hidden quantum layers, resulting in a better overall performance for relatively bigger n and m.Moreover, for entangled ansatz smaller width quantum layers (n = 8), have better accuracy with relatively wider quantum layers (m = 8), as shown in Figure 13, whereas for unentangled ansatz, the HQNN has a comparable performance for almost all the circuit depths 17.For wider quantum layers (n = 14) in unentangled ansatz, again all the depths have comparable performance.However, a smaller depths (m = 2) deem more feasible, because not only it has a slightly better performance, it would also require less training time compared to deeper layers.In conventional ML, it is always conjectured that, for a reasonably complex model, increasing the input data yields better performance.In HQNNs, when the data is being encoded in rotation angles, we are bound to increase the input feature dimension, if the quantum layers width is required to be increased.
From the experimental results presented in this paper, we concur that for entangled ansatz, the HQNNs does not follow the speculated belief (in conventional ML) of better performance with more input data (Figure 13), because of the so-called phenomenon of BP due to an increase in hidden quantum layers width.However, for unentangled ansatz, we observe that this very presumption (more data = better accuracy), works and increasing the input feature dimension does infact lead to a better overall performance.Based on these observation, we can safely state that the removal of entanglement in case of angle encoding can potentially avoid (or delay) the BP.Furthermore, unlike the case of amplitude encoding for unentangled ansatz, where an increase in n and almost all corresponding m, the model struggles to find the cost minimizing direction (Figure 16), in case of angle encoding the journey to converge to minimum in cost function landscape is more smoother as shown in Figure 18.Lastly, although the learning process in unentangled ansatz for angle encoding is more reliable than amplitude encoding, it also exhibits some insensitivity to regularization technique specifically for bigger n which can be seen in Figure 18c.However, the insensitivity to regularization is more prominent in case of amplitude encoding (Figure 16).Based on the trainability vs. expressibility analysis of both entangled and unentangled ansatzes, we concur that entanglement does affect the HQNNs training and eventually their performance, in correspondence with the underlying encoding strategy.This calls for a rather straight-forward comparison of ansatzes (with and without entanglement) to easily understand the role of entanglement in HQNNs.The comparison is presented in the following section.

Effect of Entanglement in HQNNs
In previous sections, the analysis of BP existence in HQNNs along with their trainability vs. expressibility is presented.However, that analysis does not explicitly highlight the importance of entanglement inclusion/removal in the underlying quantum layers.Entanglement is an important fundamental property of quantum mechanics and is a key to construct expressible PQCs in HQNNs.Consequently, it is vital to understand the role of entanglement in HQNNs for real-world applications.In this section, we compare the results obtained for both ansatzes and analyze if entanglement plays any role in overall performance of HQNNs.We observe that the entanglement does affect the HQNN's performance.However, whether its affect result in performance enhancement or degradation, is dependent on how the data is being encoded.Therefore, for understanding the role of entanglement in HQNNs, we compare both ansatz structures first with amplitude encoding and then with angle encoding.

Effect of Entanglement in HQNNs -Amplitude Encoding
When the classical data is being encoded in qubit state vector and then trained using PQC, the problem of vanishing gradients for unentangled ansatz quantum layers is quite prominent, as discussed in section 6.1.3,resulting in significant reduction in HQNN's performance.A brief comparison of the performance for both ansatzes is shown in Figure 19.We observe that for all n and corresponding m, the performance of underlying HQNN with quantum layers for entangled ansatz is significantly better than unentangled ansatz.It can also be observed that the performance degrades as the number of qubits are increased.Based on the results shown and discussed, we concur that the inclusion of entanglement in quantum layers while constructing the HQNNs is not-at-all beneficial, when the data is encoded in qubit amplitudes.Consequently, it is recommended to use single-qubit parameterized unitaries only, when using amplitude embedding approach, for better performance and reduced training time.

Effect of Entanglement in HQNNs -Angle Encoding
Entanglement is into the play also, when the classical data is encoded into qubits rotation angles, which is then trained using parameterized quantum layers.However, unlike amplitude encoding, unentangled ansatz quantum layers results in performance enhancement in case of angle encoding.As shown before in Figure 10a and 10b, without any entanglement in quantum layers of underlying HQNNs, the allowed circuit depth is greater than that of quantum layers with entanglement 8a and 8b.The performance comparison of both ansatzes when the data is encoded qubit rotation angles, is shown in Figure 20.Unlike amplitude encoding, here we observe that in general the model performs better when there is no entanglement included in quantum layers for all n and m.The performance enhancement becomes more prominent when n increases.
Based on the results, we concur that, removal of entanglement (unentangled ansatz) reasonably enhances the model's performance when the data is encoded using angle encoding.Moreover, not only the allowed quantum layer(s) depth is greater than that of entangled ansatz quantum layers, but the training time is also reduced.

Application-Oriented Evaluation of HQNNs for Classification Tasks
In previous sections, we discuss the results of our proposed framework based on accuracy and loss convergence.However, for classification tasks (target application in this paper) in ML, accuracy alone is not a sufficient metric to gauge model performance because it only shows the percentage of correct predictions out of the total predictions.Therefore a more diverse set of evaluation metrics are used for classification problems.
Although we consider a multi-class classification problem in this paper, explaining other evaluation metrics (precision, recall and F1-score) for multi-class classification is rather tricky.Therefore, for simplicity, we will demonstrate the need of more diverse evaluation metrics for a binary classification, which are directly applicable to multi-class classification also.In binary classification, there are two classes (positive and negative) for which the ML the model aims to predict the correct class.Conventionally, the accuracy in this case would be the sum of correctly predicted classes, regardless of what (correct) class was predicted.
When a positive sample is classified as negative, it is called False Negative (FN), and a negative sample predicted as positive is knows as False Positive (FP).When the positive and negative samples are correctly predicted to their respective classes, this is called True Positive (TP) and True Negative (TN), respectively.Classifying the performance based on these classes allows us to calculate other important metrics, namely: precision and recall.Precision tells us what proportion of predictions are truly positive whereas recall tells that what proportion of actual positives are correctly classified.Another important evaluation metric is F1-score, which is just the harmonic mean of precision and recall.Mathematically, all these metrics can be calculated using the following equations.

Recall =
T P T P + F N (14) Precision = T P T P + F P (15) A high precision and recall scores are always desirable but in practice, classifiers are prone to errors and can result in different precision and recall scores.Therefore, a trade-of between these two scores may need to be made, and is highly application dependent.For instance, for a video recommendation system, a high precision would be more desirable to make sure that all potential videos are being recommended to the user.Similarly, a classifier to detect cancer in patients would need a high recall so that as many cancer patients as possible are correctly diagnosed.The F1-score typically is more useful for performance comparison between the classifiers.For instance, in case of two classifiers, where one classifier has a better precision while the other has a better recall, then the F1-score is an appropriate metric to pick the best classifier.
We now briefly evaluate the HQNNs with respect to target (classification) application(s) for both the ansatz structures and data encodings used in this paper.Based on the results discussed in section 6.2 and 6.3, we concur that entangled ansatz structures performs better with amplitude encoding, whereas unentangled ansatz yields better performance with angle encoding.Therefore, we present the application-oriented evaluation of HQNNs only for the best performing ansatz structures with corresponding encodings (entangled ansatz with amplitude encoding and unentangled ansatz for angle encoding).The classifiers with both high precision and high recall (close to overall accuracy) are generally considered to be good classifiers.We observe that in HQNNs, the precision and recall scores follow the same trends as accuracy for all the respective experiments performed in this paper, making them more reliable for a wide range of applications.

Application-Oriented Evaluation -Amplitude Encoding
We first present the application-oriented analysis for amplitude encoding and entangled ansatz.The precision, recall and F1-score are plotted as a function of m for all n, as shown in Figure 21.Based on the results shown in Figure 21, we observe that for all n and m, the precision, recall and corresponding F1-score increase and decrease with almost the same ratio and can be equally applicable for both high precision and high recall applications.The general trends of precision, recall and F1-scores on entangled ansatz with amplitude encoding follows almost the same trend as that of the accuracy (Figure 11).This means that for smaller n, we require relatively bigger m (and vice versa), to achiever a better recall and precision scores for the applications.

Application-Oriented Evaluation -Angle Encoding
Since the unentangled ansatz performs better with angle encoding, we only consider these results for application oriented analysis.The results of precision, recall and F1score with fixed n and variable m are presented in Figure 22.All the performance metrics increase and decrease with same ratio, analogous to the case of amplitude encoding, and hence are equally suitable for either applications (High precision or recall).However, when encoding the data in qubit rotation angles yields slightly better recall than precision for all the training experiments, and is therefore more appropriate for high recall applications.The previous sections of this paper have presented and discussed a framework that focuses on the analysis of three major components of HQNNs.These are data encoding, ansatz expressibility, and entanglement inclusion/removal.In other words, there can be various possible scenarios coming from constraints on these three components of HQNN.Therefore, the constraints on these components of HQNNs design, both individually and with respect to each other, are required to be considered.In this section, we highlight the significance of our proposed framework for various constraint scenarios by recommending specifications for different parameters of HQNNs design based on our results in section 6.2.Summaries of the recommendations for each constraint scenario can be found in Appendix A.

No Constraint
In this scenario, we assume the designer has no constraint on any parameter and has the liberty to choose any type of data encoding (amplitude or angle), an arbitrary ansatz width and depth (expressibility).Furthermore, the inclusion or removal of entanglement is also not a constraint.In such a scenario, based on the results of our proposed framework, it is recommended to use angle encoding with no entanglement between qubits in the underlying quantum layers.Furthermore, based on both overall accuracy and loss convergence, the ansatz width of n = 10 and depth of m = 4 is recommended, as shown in Figure 17 and 18b.This is the best obtained result of our framework.

Constraint on a Single Parameter
In this scenario, we consider the possibility when there is a constraint on a single design parameter from data encoding, ansatz expressibility and entanglement in quantum layers.

Constraint on Data Encoding.
Here, we assume that there is a constraint on the data encoding scheme to be used.If it is required to encode the data in qubit amplitudes, then the entanglement is not recommended to be included in the underlying quantum layers.Moreover, the optimal width and depth of quantum layers is n = 6 and m = 10, as shown in Figure 11 and 12a respectively.On the other hand, if it is constrained to use angle encoding then the same set of HQNNs can be selected as discussed in section 7.1.

Constraint on Ansatz Expressibility.
Ansatz expressibility has two factors: ansatz width and ansatz depth.If the constraint is on ansatz width, then for smaller widths (n = 6, 8), amplitude encoding with moderate depth (m = 6) is more appropriate to use to improve the accuracy and convergence, as shown in Figure 11 and 12b.In addition, entanglement is recommended to be included.On the other hand, for bigger widths (n = 10, 12, 14), angle encoding is a better choice.For the selection of other parameters in this scenario, the discussion in section 7.1 can be followed.
If the constraint is on the depth of quantum layers, in the case of smaller depths (m = 2, 4), angle encoding with unentangled ansatz works better.Moreover, the allowable width of quantum layers is greater, i.e., n = 10, 12, 14.Although, it allows for wider quantum layers, using n = 10 would be more appropriate due to the shorter training time.Similarly, for moderate depth to deeper quantum layers (m = 6, 8, 10), both amplitude (with entanglement) and angle encoding (without entanglement) have comparable performance in terms of accuracy.However, with amplitude encoding, the allowable width of the quantum layers is relatively smaller (n = 6) and angle encoding allows to have a wider quantum layer (n = 10, 12, 14) and is hence less susceptible to BP, unlike amplitude encoding.

Constraint on Entanglement.
For the case when there is a constraint requiring entangling the qubits, amplitude encoding with ansatz width and depth of n = 8 and m = 6 can be used.On the other hand, when there is a constraint requiring that qubits should not be entangled, then angle encoding performs better, and the same specification as discussed in section 7.1 can be picked.

Constraint on Two Parameters
In this scenario, we consider the possibility when there is a constraint on any two of the parameters from data encoding, ansatz expressibility and entanglement inclusion/removal.

Constraint on Data Encoding and ansatz expressibility.
The ansatz expressibility has two factors: width and depth of quanutum layers.Therefore, we separately consider constraints on both these expressibility factors along with constraints on data encoding.We first consider constraints on the data encoding and ansatz width.In the case of both amplitude and angle encoding, for smaller widths (n = 8, 10), the appropriate depth is greater (m = 8, 10), whereas for bigger widths (n = 12, 14) the appropriate depth reduces to m = 2, 4. Similarly, in the case of constraint on ansatz depth and data encoding, again there is a trade-off between ansatz depth and width.Furthermore, in both cases (constraint on ansatz width or depth), it is recommended to entangle the qubits when encoding in qubit amplitudes, whereas for angle encoding, entangling qubits does not improve the performance.

Constraint on Data Encoding and Entanglement.
We now consider a setting when there is a constraint on data encoding and entanglement inclusion/removal.If there is a constraint requiring the data to be encoded in qubit amplitudes, then irrespective of whether the qubits are entangled or not, the allowable width of quantum layers is relatively lower, i.e., (n = 6, 8).The bigger widths lead to performance decline, indicating the presence of BP.However, in the case the qubits are entangled in underlying quantum layers, the allowable depth is bigger (m = 8, 10), whereas in case of no qubit entanglement, the allowable depth is reduced (m = 2, 4).
Similarly, When it is constrained to encode the data in qubit rotation angles, then irrespective of entanglement, the appropriate depths are bigger than amplitude encoding, (n = 10, 12, 14), making it less prone to BP and eventually lead to better performance.However, in the case of entanglement inclusion, the allowable depth of quantum layers is greater (m = 8, 10), whereas in the case of no entanglement, the allowable depth is reduced (m = 2, 4).

Constraint on expressibility and entanglement.
We first consider the setting when there is a constraint on the width factor of expressibility while depth can be arbitrary.Moreover, the entanglement inclusion/removal is also constrained.
When the qubits in underlying quantum layers are constrained to be entangled and the corresponding width of quantum layers is required to be smaller (n = 8, 10), then amplitude encoding is better, with deeper quantum layers (m = 10).However, if the required width of quantum layers is greater (n = 12, 14) along with entanglement inclusion, then angle encoding is relatively better and the appropriate depth of underlying quantum layers is smaller (m = 2).On the other hand, when the entanglement is constrained to be removed from quantum layers, then irrespective of width constraints, angle encoding is better than amplitude encoding and the appropriate depth is typically smaller (m = 2, 4), for all corresponding widths used in this paper.
We now consider the setting when there is a constraint on depth factor of expressibility along with entanglement inclusion/removal.When the constraint is to have smaller depth (m = 2 or 4), and entanglement is constrained to be included also, then both the encodings (amplitude or angle) achieve similar performance and allows to have more wider quantum layers (n = 10).However, when the constraint is to have more deeper quantum layers (m = 6, 8 or 10), along with entanglement inclusion, then amplitude encoding performs better than angle encoding.Moreover, the appropriate width of quantum layers is also greater (n = 8).When the entanglement is required to be removed, then irrespective of the depth, angle encoding is better than amplitude encoding and the appropriate width of underlying quantum layers is typically higher (n = 12 or 14).

Constraint on All Three Parameters
In this scenario, we consider the setting where there is a constraint on all the parameters (data encoding, ansatz expressibility and entanglement) of HQNN.As discussed earlier, the ansatz expressibility has two factors, i.e., width and depth of quantum lay-ers.Therefore, we separately consider constraints on both these expressibility factors along with constraints on other data encoding and entanglement.

Constraint on data encoding, ansatz width and entanglement.
When there is a constraint on ansatz width along with data encoding and entanglement inclusion/removal.When it is constrained to encode the data in qubit amplitudes, entanglement is constrained to be included, and width is required to be smaller (typically n = 6, 8, 10), then to achieve a relatively better performance, the depth of underlying quantum layers should be greater (typically m = 10).However, if the width is constrained to be bigger (n = 12, 14), then the appropriate depth reduces to m = 4 to achieve better performance.Similarly, when the entanglement is required to be removed, then irrespective of the constrained width, the appropriate depth is typically smaller (m = 2).
On the other hand, when it is constrained to encode the data in qubit rotation angles, entanglement is required to be included and ansatz width is required to be smaller (typically n = 8, 10), then the appropriate ansatz depth is relatively greater (m = 8).However, when the ansatz width is constrained to be bigger (typically n = 12, 14), then analogous to the case of amplitude encoding, the appropriate ansatz depth is reduced (typically m = 4).Similarly, when the entanglement is required to removed and the data is required encoded in qubit angles, then irrespective of the constrained width, the smaller ansatz depth (m = 2) is more appropriate to use.

Constraint on data encoding, ansatz depth and entanglement.
We now consider the setting when the ansatz depth is constrained along with data encoding and entanglement inclusion/removal.In such a setting the ansatz width can be arbitrary.When the data is constrained to be included in amplitude encoding, entanglement is also required to be included and the ansatz depth is constrained to be smaller (typically m = 2, 4) then the appropriate ansatz width is slightly bigger (n = 10, 12).However, when the ansatz depth is constrained to be moderate, i.e., m = 6 the appropriate ansatz width reduces to n = 8, which further reduces to n = 6 for more deeper ansatz (m = 8, 10).Similarly, when it is constrained to remove entanglement, then irrespective of the depth constraint, the moderate ansatz width (n = 6) is more appropriate to achieve relatively better performance.
On the other hand, when the constraint is to encode the data in qubit angles, entanglement is required to be included and the ansatz depth is constrained to be smaller moderate (typically m = 2, 4, 6), the appropriate width is relatively bigger (n = 12, 14).However, if the ansatz depth is constrained to be bigger (m = 8, 10), the appropriate width reduces to n = 8.Similarly, when the entanglement is constrained to be removed and data is constrained to be encoded in qubit angles, then irrespective of the depth constraint, the appropriate ansatz width is bigger, i.e, n = 12, 14.

Conclusion
Quantum machine learning (QML) has recently emerged as one of the potential applications of quantum computing, attempting to improve the classical machine learning by harnessing quantum mechanical phenomena.In QML, quantum neural networks (QNNs) are widely being explored because of the unparalleled success of their classical counterparts, namely neural networks (NNs).However, the practical applicability of QNNs is challenged by the phenomenon of barren plateaus (BP), where the gradients of parameters become exponentially small as the system size increases potentially making QNNs untrainable.To this end, the primary components of QNNs, i.e., data encoding, ansatz expressibility, and entanglement between qubits have been identified as the potential sources of BP.All these components have been studied individually from the aspect of BP, however, these components exist simultaneously in a practical setting.Therefore, investigating their joint effect, with respect to each other is of significant importance for practical applications.
In this paper, we propose a framework to empirically investigate the holistic effect of all the aforementioned components of QNNs for a practical application namely; multi-class classification.In a practical setting, because of the limitations of noisy intermediate-scale quantum devices, hybrid quantum neural networks (HQNNs) are widely being used to explore the potential quantum advantage in QNNs.Since the HQNNs completely replicate the general QNN architecture (with some classical input pre-and post-processing), the analysis of quantum parts of HQNN can be directly applicable to QNNs.The HQNNs we have used for our analysis, consist of the following sequence of operations; 1) input dimensionality reduction, 2) qubit initialization, 3) data encoding (classical to quantum feature mapping), 4) quantum ansatz (parameterized quantum circuit), 5) qubit measurements and 6) dense classical neuron layer to post-process the qubit measurement results and get the output.
Our analysis focuses on the data encoding and quantum layers (their expressibility and entanglement inclusion removal), which are the main components of QNNs.For data encoding, we use two frequently used data encoding techniques, namely: amplitude and angle encoding.For ansatz expressibility, we change the width (n) and depth (m) of quantum layers.We train our HQNNs with underlying ansatz for n and m.We consider two similar ansatz structures, entangled ansatz (which contains single-qubit parameterized unitaries and nearest neighbor entanglement), and unentangled ansatz (which contains single-qubit parameterized unitaries only).We first benchmark the mean accuracy of the training experiments and demonstrate the existence of BP in HQNNs.We observe that the BP in HQNNs does not follow a direct relation with the number of qubits but is dependent on the overall expressibility of quantum layers.We then benchmark the overall accuracy and loss convergence of HQNNs and perform a comprehensive trainability vs. expressibility analysis.This analysis shows how the ansatz expressibility plays a role in the overall performance of HQNNs from the aspect of BP and how deep an ansatz can be for a given width before experiencing the BP, for each encoding.Furthermore, we observed that entanglement plays a role in the training landscapes of HQNNs and is dependent on the encoding type.When the data is encoded in qubit state vector, the entangled ansatz achieves better accuracy than the unentangled ansatz, demonstrating a positive impact of entanglement on the trainability of HQNNs in the case of amplitude encoding.On the contrary, when the data is encoded into qubit rotation angles, unentangled ansatz yields better accuracy than entangled ansatz, demonstrating a negative impact of entanglement on HQNNs trainability.We also briefly evaluate the HQNNs for classification applications considering other important evaluation metrics for classification problems, namely: precision, recall and F1-score.Finally, we illustrate the significance of our proposed framework by providing recommendations for different constraint scenarios (both alone and combined) on data encoding, ansatz expressibility and entanglement inclusion/removal in the underlying quantum layers.

Figure 4 .
Figure 4.: Quantum layer ansatz structure with entanglement.The light pink highlighted region is the data encoding part and either the QubitStateVector or R y (θ) rotation gates are used depending on the encoding technique used.The former is used in case of amplitude encoding and the later in case of angle encoding.The green shaded area is the actual quantum ansatz used in training.The parameterized rotation unitaries in yellow boxes are trainable quantum parameters and the vertical blue bars represents two-qubit unitaries

Figure 5 .
Figure 5.: Quantum layer structure without entanglement.The light pink and gray highlighted regions represents the same as in Figure 4

Figure 6 .
Figure 6.: List of experiments performed in this work for different n and m.

Figure 7 .
Figure 7.: Entangled Ansatz: Mean accuracy for different m and n with amplitude embedding.

Figure 8 .
Figure 8.: Entangled Ansatz: Mean accuracy for different m and n with angle encoding.

Figure 11 .
Figure 11.: Entangled Ansatz: Accuracy trends for all n and m for amplitude encoding.

Figure 13 .
Figure 13.: Entangled Ansatz: Accuracy trends for all n and m for angle encoding.

= 14 Figure 14 .
Figure 14.: Entangled Ansatz: Loss trends for all n and m for angle encoding.

= 12 Figure 16 .
Figure 16.: Unentangled Ansatz: Loss trends for all n and m for amplitude encoding.

Figure 17 .
Figure 17.: Unentangled Ansatz: Accuracy trends for all n and m for angle encoding.

= 12 Figure 18 .
Figure 18.: Unentangled Ansatz: Loss trends for all n and m for angle encoding.

Figure 20 .
Figure 20.: Performance comparison of both ansatz structures for angle encoding.

14 Figure 21 .
Figure 21.: Precision, Recall and F1-score for fixed n and variable m for amplitude encoding with entangled ansatz

Figure 22 .
Figure 22.: Precision, Recall and F1-score for fixed n and variable m for angle encoding with unentangled ansatz

Table A10 .
: Recommendations when there is constraint on entanglement and ansatz depth