Accelerating AI performance with the incorporation of TVM and MediaTek NeuroPilot

The continuing prominence of machine learning has led to an increased focus on enhancing the inference performance of edge devices to reduce latency and improve efficiency. Two widely adopted strategies for accelerating computational performance are quantisation and the utilisation of AI hardware accelerators. Each type of accelerator or inference engine offers distinct advantages, with accelerators primarily designed to optimise neural network operations. In this paper, we present an innovative method for integrating TVM's quantisation flow with the MediaTek Neuropilot AI accelerator. We outline the process of converting the TVM relay intermediate-representation quantised neural network dialect model to a tensor-oriented quantisation format, with the aim of harnessing the full potential of both TVM and MediaTek NeuroPilot. This integration enables more efficient neural network inference while preserving the accuracy of the results. We assessed the effectiveness of our proposed integration by conducting a series of experiments and comparing the performance of our approach with that of TVM equipped with an autotuning mechanism. The findings indicate that our approach substantially outperforms TVM in both floating-point model inference and quantised model inference, with inference speedups of up to 11× and up to 70×, respectively. These results underscore the potential of our approach in accelerating AI performance across a diverse range of applications and edge devices. Moreover, a key contribution of our work is providing a valuable practical method for other hardware companies interested in integrating TVM with their own accelerators to achieve performance gains.


Introduction
The rapid development of machine learning has led to the emergence of numerous technologies aimed at enhancing recognition accuracy to support complex applications and improve the quality of human lives.Most applications developed during the early stages of machine learning relied on powerful servers to improve performance due to the computationally intensive nature of machine learning applications.AI inference was primarily CONTACT Chao-Lin Lee clli@pllab.cs.nthu.edu.twaccessible through cloud servers via a network.However, recent advancements in computing power now enable computations to be performed on edge devices, significantly reducing latency and improving user experiences.These advancements have enabled the deployment of applications such as real-time, cross-language translation applications, increasingly human-like voice assistants, and sophisticated image recognition systems directly on edge devices.Despite these improvements, the primary challenge for AI chip companies and developers is to enable low-latency machine learning on edge devices, such as smartphones, which generally lack the CPU and GPU power of servers.Enabling low-latency machine learning on edge devices requires innovative approaches that can harness the relatively low computational resources available on such devices.The efficient utilisation of AI accelerators, optimised software solutions, and seamless integration of hardware and software components is essential to achieving this goal.In addition, it is crucial to ensure that these approaches not only provide significant performance improvements but also maintain model accuracy and reliability.Furthermore, the development of a standardised and flexible framework for integrating various AI accelerators and software optimizations can significantly streamline the process of deploying machine learning models on edge devices.Such a framework would enable a wider adoption of AI technologies in diverse domains, ranging from smart home applications to autonomous vehicles and beyond.
To address this challenge, two common methods for accelerating inference have emerged: • Hardware solution: AI accelerators such as the MediaTek AI Processing Unit (APU) and Google Tensor Processing Unit, which are designed for neural network processing, offer much higher performance than general-purpose CPUs since these accelerators are specifically optimised for the computation patterns and data structures commonly found in neural networks.• Software solution: Quantization techniques use reduced-precision integer representations to reduce the memory footprint and computational complexity, which increases throughput and reduces memory consumption with only a slight trade-off in accuracy.
Motivated by the need for more efficient, low-latency solutions, we have explored the potential benefits of integrating TVM's quantisation flow with MediaTek NeuroPilot, an AI accelerator optimised for edge devices.By combining the power of TVM's loop-level scheduling, compiler optimisation, and quantisation capabilities with the specialised processing capabilities of MediaTek NeuroPilot, our aim was to achieve significant performance improvements in neural network inference while maintaining model accuracy.
In this paper, we take advantage of the Bring Your Own Codegen (BYOC) mechanism provided by TVM (T.Chen et al., 2018), which makes it easier for developers to establish a bridge between the TVM runtime and the external runtimes for GPUs or AI accelerators.Compared with previous research (Lai et al., 2020) that employed the TVM BYOC mechanism to engage an AI accelerator for inference via the Android Neural Networks API (NNAPI) on mobile devices, our approach deploys MediaTek's Neuron Compiler on the APU, which is Medi-aTek's proprietary AI processing unit specifically designed to enhance the performance of AI applications on MediaTek-powered devices, in the System on Chip (SoC) architecture and integrates the quantised neural network (QNN) flow.The primary contributions of this paper are presented as follows: • Designing an external compiler capable of generating the Neuron intermediate representation (NIR) from offloaded relay submodules, thereby facilitating seamless integration of the APU with TVM's existing infrastructure.• Proposing an innovative method for converting tensor-oriented quantised representations to operator-oriented quantised representations, which enhances the efficient utilisation of hardware accelerators.• Enabling developers to harness the APU in the SoC architecture to achieve acceleration without additional effort, streamlining the development process and expediting timeto-market for AI applications on edge devices.
These contributions provide a comprehensive understanding of the method for integrating the TVM runtime and external runtime.We performed experiments to evaluate well-known deep neural network (DNN) models to determine the effectiveness of our proposed methods in allowing MediaTek Neuron to utilise TVM to not only support a larger number of DNN models but also improve performance.This work serves as a stepping stone toward more efficient and versatile solutions for accelerating inference performed in edge devices across a wide range of applications.
The remaining sections of this paper are organised as follows: Section 2 describes the background of our research, including TVM, MediaTek Neuron, and quantisation schemes.Section 3 presents a high-level view of our work and the detailed implementation of TVMNIRCompiler.Section 4 introduces the TVM QNN dialect and the conversion to the Neuron quantised format.Section 5 shows the experimental results of this work.Related work and a discussion are presented in Section 5, with conclusions drawn in Section 7.

Background
This paper focuses on integrating TVM, MediaTek NeuroPilot, and quantisation techniques to accelerate AI performance on edge devices.To provide a better understanding, background information is provided on each of these components, including the TVM BYOC mechanism.

TVM
Apache TVM is a comprehensive, open-source compiler framework for machine learning that supports a wide range of front ends, such as ONNX (Bai et al., 2019), Tensor-Flow (Abadi et al., 2016), TFlite (Tensorflow Lite: Ml for Mobile & Edge Devices, n.d.), and PyTorch (Paszke et al., 2019).The framework generates binary archives that are executable on multiple back ends, including x86 and Android platforms.Furthermore, TVM enables heterogeneous computing, which leverages various hardware components, such as CPUs, GPUs, DSPs, and DLAs, to improve performance.A deep learning accelerator (DLA) is specialised hardware for accelerating deep learning computations.In this paper, MediaTek's NeuroPilot is a platform that supports AI processing on MediaTek devices.The platform is integrated with an AI Processing Unit (APU), which is a type of DLA.To effectively optimise these diverse systems, TVM offers two levels of unified intermediate representation (IR): relay IR (graph-level IR) and tensor-level IR.TVM employs distinct parsers for each front-end and converts them to relay.As a graph-level IR for machine learning systems, relay delineates the data flow among machine learning operators such as convolution, max pooling, and ReLU.Listing 1 provides an example text form of relay, and Figure 1 shows a schematic graph of the relay abstract syntax tree (AST), both of which are parts of Mobilenet_v1.The Listing-provided relay intermediate representation (IR) represents a segment of the MobileNet_v1 model.It begins with a 3-channel, 224 × 224 input tensor which is passed through a 2D convolution layer (nn.conv2d) using 32 3 × 3 filters, a stride of 2 × 2, and 1pixel padding.The feature maps are then normalised (nn.batch_norm) and passed through a ReLU6 activation function (clip), capping values between 0 and 6.This pattern of convolution, normalisation, and activation is a key part of MobileNet_v1's architecture.The figure indicates that each relay node encompasses inputs, outputs, types (e.g.data type and shape), and operator attributes (e.g.stride, dilation, and data layout).
def main( %x: Tensor [(1, 3, 224, 224), float32], %const0: Tensor [(32, 3, 3, 3) Figure 1 illustrates the various types of relay nodes and their respective roles within the relay abstract syntax tree (AST).Each node in Figure 1 represents a distinct element in the data flow of a machine learning model: These nodes collectively form the relay graph, which is subsequently processed and optimised by TVM to generate efficient code for various target platforms and accelerators.

TVM BYOC mechanism
With the aim of reducing the latency of machine learning applications, each AI chip vendor designs its own AI accelerator, using ASIC to accelerate the supported AI operator.One of the key features of TVM is the BYOC mechanism, which enables the integration of the TVM runtime with external runtimes, such as AI accelerators.The BYOC mechanism allows developers to offload specific portions of a model's computation graph to an external code generator or runtime, thereby optimising the execution for a particular hardware back-end.
The BYOC mechanism is based on the relay pass infrastructure and allows developers to design the following three components: • Partition/annotation method: In this stage, developers need to annotate which operators or patterns will be offloaded to the external compiler.Generally, the attribute and argument of the relay operators are examined to determine if the AI accelerator is supported.• External compiler: In consideration of their AI accelerators, developers must determine how to build the offloaded relay module into the accelerator-executable archive.• External runtime: To support coinference, developers need to decide how to invoke the external runtime and handle memory movement between the TVM and the external accelerator.Figure 2 depicts the architecture of the BYOC mechanism.First, the TVM front-end converts the models of different formats to the equivalent relay IR.Second, the TVM front-end then goes through the partitioner, external compiler, and external runtime as mentioned above.Last, coinference is performed with the AI accelerator.

MediaTek NeuroPilot
NeuroPilot is MediaTeks AI technology platform, which includes a dedicated APU optimised for neural network processing.

Intermediate representaion
In this section, we focus on two distinct intermediate representations (IRs): MediaTek's Neuron IR and TVM's relay IR, each boasting specific characteristics advantageous to certain tasks.
Neuron IR, which was developed by MediaTek, embodies a tensor-oriented approach to quantisation.Each tensor carries unique scale and zero point parameters, introducing the flexibility to assign individual tensors with different quantisation parameters.This granularity potentially enhances precision control on a per-tensor basis.Neuron IR is also designed to seamlessly interface with MediaTek's hardware accelerators, facilitating optimal model deployment on MediaTek devices.
Conversely, TVM's relay IR adopts an operator-oriented approach to quantisation.Here, quantisation parameters are associated with operators rather than with individual tensors.Each operator in the computation graph contains information on how it alters its input and output data in the quantised domain.This setup enables an in-depth understanding of quantisation impacts on each operation's results, leading to operator-level precision control.
A significant portion of this paper is devoted to the transition between these two distinct IRs: relay IR and Neuron IR.Our developed method transforms the operator-oriented quantisation scheme in relay IR into the tensor-oriented scheme in Neuron IR.This pivotal contribution ensures correct translation and application of quantisation parameters, enabling the practical integration of TVM with MediaTek's hardware accelerators.

Utilizing MediaTek Neuron with TVM BYOC
Section 2.1.1 briefly introduces the TVM BYOC mechanism.In this section, we focus on how we implemented our external compiler to compile the offloaded relay submodules into the equivalent NIR.

Overview of the external compiler's architecture
The architecture of our external compiler, as depicted in Figure 3, includes two distinct passlike structures, QuantParamsMaker and TVMNIRCompiler.The operation of our external compiler is segmented into the following four main steps: (1) The offloaded relay graph is processed, and the quantisation parameters for each tensor of the Neuron IR are calculated using the QuantParamsMaker module.This module generates a QuantParamsMap that stores these parameters.
(2) The TVMNIRCompiler is utilised to convert relay AST nodes to their corresponding NIR types.
(3) Each relay OP is mapped to the Neuron IR in the TVMNIRCompiler, which creates an NIR graph.(4) A compiled deep learning archive is generated for the APU with the Neuron compiler.
In this process, the initial step specifically enables quantisation when necessary, with the QuantParamsMaker module calculating and storing the quantisation parameters for each tensor of the Neuron IR.The subsequent steps include converting relay AST nodes to NIR types via TVMNIRCompiler, mapping of each relay OP to the Neuron IR using the visit_call function within TVMNIRCompiler, and generating a compiled deep learning archive for the APU.This flow highlights the flexibility of our external compiler's architecture, which can handle different quantisation configurations and requirements.The following sections provide more detailed information about each module in these flows and their specific roles in the compilation process.
The relay IR in TVM is an AST structure designed for the optimisation and compilation of deep learning models.The TVM provides a structure named ExprVisitor to traverse and analyse the relay AST.As a fundamental component within TVM's infrastructure, ExprVisitor is employed in various optimisation passes, including the initial three steps of the proposed approach in Figure 3.The ExprVisitor class facilitates navigation and processing of the relay AST, enabling the implementation of custom transformations and operations for optimising the deep learning model across different hardware targets.
Listing 2 presents a simplified definition of the ExprVisitor class.The visit method, denoted as 'def visit(self, expr)', on Lines 3-9 processes different types of nodes in the AST and directs node operations to the functions visit_function, visit_call, or visit_var for function, function call, or variable nodes, respectively.Line 3 indicates that the visit method examines the type of current expression (expr) and then invokes the corresponding visit methods.Subsequent calls to the visit_function traverse the entire graph based on the type of expression.For example, the visit_function on line 21 initially visits each parameter before proceeding to the function body.The visit_tuple method on line 12 explores each field of a tuple node.The visit_call on line 16 examines the operator and its arguments.In this representation, the ExprVisitor class is designed to traverse and process Expr instances in a structured manner, enabling efficient navigation and analysis of the relay AST for optimising the deep learning model.

TVMNIRCompiler
TVMNIRCompiler, built upon ExprVisitor, is specifically designed to transform offloaded relay modules into NIR modules.TVMNIRCompiler performs two main functions: (1) converting the TVM data structure to a Neuron-compatible format and (2) mapping relay operations to Neuron operation layers.The functionality of TVMNIRCompiler is previously described in the second and third steps of Section 3.1.
Algorithm 1 lists the proposed TVMNIRCompiler.The NIRCompiler class is designed to convert TVM relay expressions to an NIR format.The TVMNIRCompiler algorithm converts a relay module to an NIR module using a postorder depth-first search (DFS) traversal of the relay AST.The TVMNIRCompiler class is inherited from the ExprV isitor class and processes each node by overriding the V isitExpr method based on the node type, such as V arNode, ConstantNode, TupleNode, or CallNode.This step involves creating a NodeEntry object, converting the node to a Neuron-compatible format, updating the NodeEntry's inputs and outputs, and inserting the NodeEntry into the nodeDict dictionary.The NodeEntry class is utilised to store metadata pertaining to the current AST node.For the TupleNode and CallNode types, the algorithm processes their child nodes and updates the inputs and outputs accordingly.For CallNode types, a Neuron operation is created using opTable, and nodeDict is updated with the resulting NodeEntry object.The algorithm's input is a relay module, and the output is an NIR module.The conversion ensures that the TVM data structure is adapted to a Neuron-compatible format and maps relay operations to Neuron operation layers.• tvm::DataType is the structure that TVM uses to represent the element type; thus, it maps to neuron::nir::DataType.• tvm::TensorTypeNode contains the data type and shape of the tensor, and the equivalent of Neuron is neuron::nir::Shape • tvm::relay::VarNode is input to the relay AST, so it is mapped to neuron: :nir::Input.In Algorithm 1, the equivalent neuron input is created based on the data type and shape of V arNode.• tvm::relay::ConstantNode describes that it is a constant of the relay AST, and so it is mapped to neuron::nir::Constant.In Algorithm 1, the equivalent Neuron constant is constructed based on the data type, shape, and actual constant data.• tvm::relay::CallNode indicates that the node calls machine learning operators or relay functions (composed of patterns), and the Neuron equivalent is neuron::nir::Layer.The mapping of each relay operation to the Neuron is described in Section 3.2.2.• tvm::relay::TupleNode is a tuple of tensors whose fields include V arNode, Con-stantNode, CallNode, etc.However, since there is no corresponding structure in the neuron, the tuple node is flattened (as described in Algorithm 1); that is, each tuple field is visited, and then the output of each field is put into NodeEntry.• tvm::runtime::Array usually serves as the operator's attribute to represent a set of elements so that it will map to the different Neuron types, for example, the stride and dilation of conv2d map to NNSize and the padding map to NNPadding.

Map relay operator to NIR layer
In TVMNIRCompiler, when visiting CallNode, it is necessary to map the relay operator to the corresponding Neuron operator.Listing 4 is a more complex example of converting Conv2d.The first step is to extend OpSetup and overwrite CreateOperation.Lines 3 to 11 uses the required information from the operator attribute and converts it to the type used by the Neuron.The class subsequently checks the conditions on lines 13 to 17 to ensure the correct layout of the Neuron operator.In TVM, the nn.conv2d operator may map to different Neuron operators depending on specific attributes; for example, on lines 20 to 20, it could map to Neuron's Conv2D, DepthwiseConv2D, or GroupConv2D.Consequently, it is essential to perform conditional checks based on these attributes to ensure that the appropriate Neuron operator is selected.In this context, we provide a detailed example for the conv2d operator to illustrate how the mapping process accounts for varying attributes, allowing for accurate and efficient conversion between TVM operators and Neuron operators.

Utilizing the MediaTek Neuron with TVM QNN
TVMNIRCompiler, which was introduced in Section 3, can be used to integrate any nonquantised relay modules into MediaTek Neuron.However, the quantised model is also one of the main points of this work.Section 4.1 compares the quantised representation formats of TVM and MediaTek Neuron.Section 4.2 describes how to convert the QNN dialect and implement QuantParamsMaker, which retrieves the required quantisation information from the relay and passes it to TVMNIRCompiler.Section 4.3 illustrates how pattern matching overcomes the inability to perform one-to-one mapping.

Differences in quantization representation formats
Different machine learning frameworks use different methods to represent quantisation information and can generally be categorised into tensor-oriented and operator-oriented formats.Figure 4 shows quantised operator-oriented representation formats.Quantization operators in this format have their own specific declarations.zeropoint and scale are additional inputs for this specific version of the operator.For example, the operator for twodimensional convolution in TVM is nn.conv2d, while its quantised version is qnn.conv2d, as shown in Listing 5.This TVM relay IR node represents a quantised 2D convolution operation.This node uses input and weight tensors, their respective zero points and scales on lines 2 to 3, and convolution parameters such as strides, padding, channels, and kernel_size on lines 4 to 6.The out_dtype parameter on line 6 sets the output data type to int32.The resulting tensor of shape (1,112,112,32) signifies the convolution output.It can be seen that qnn.conv2d carries scale and zeropoint information in the input argument.
The tensor-oriented quantised representation formats are shown in Figure 5, which indicates that the quantised and nonquantised operators use the same declarations.Furthermore, the quantisation parameters are carried on the attribute of each tensor.Frameworks such as TFLite and MediaTek Neurons are tensor-oriented formats.Listing 6 is a quantised two-dimensional convolution of NIR, and it shows that each input, constant, and operator output carry scale and zeropoint (also referred to as the offset in the Neuron).The layer has three inputs: an image tensor (Input:0), a weight tensor (Constant), and a bias tensor (Constant).Each input carries its own scale and offset (or zero point), defining the quantisation parameters on lines 3 to 12.The operation results in an output tensor, again with its scale and offset on lines 13 to 16. Convolution parameters such as stride and padding are also provided (lines 17 to 18).In addition, although it is a quantised operator, it still uses the same declarations as the nonquantised operator.

QuantParamsMaker
Because the QNN dialect of relay has the operator-oriented format, quantisation parameters will only appear with the qnn prefix at the operator's inputs.However, Neurons have a tensor-oriented format, which means that their quantisation parameters must be carried on all tensors.This problem is addressed by using QuantParamsMaker to convert between the two formats.Similar to TVMNIRCompiler, it is also based on ExprV isitor, but it traverses the relay using preorder DFS.
Algorithm 2 is designed to convert between the operator-oriented quantisation parameter format and the tensor-oriented quantisation parameter format, using an IR module and an optional entryFunc string as input and returning a QuantParamsMap as output.The core functionality is encapsulated in the QuantParamsMaker class, which is designed to traverse relay functions and create a QuantParamsMap.
QuantParamsMaker::Create serves as the entry point for the QuantParamsMaker class.The algorithm employs a preorder DFS traversal strategy.When the VisitExpr function is called in the Create function with expr as the input, the traversal of the relay expressions begins, which subsequently returns a QuantParamsMap as output.The QuantParams-Maker class is inherited from the ExprVisitor class, allowing for efficient traversal of relay expressions using the VisitExpr function.When a CallNode is encountered during the traversal process, the VisitExpr function is executed.This function processes the calculation of QuantParams and ensures support for the operator before accessing the function parameters.
Algorithm 2 demonstrates the essential components and steps of the QuantParams-Maker class, highlighting its effectiveness in converting between different quantisation parameter formats.The algorithm processes the quantisation parameters, which are crucial for handling various operator types and ensuring compatibility with a wide range of AI models.The ExprVisitor class plays a significant role in verifying operator support and enabling the efficient traversal of relay expressions.
To compute QuantParams for various operators, we reuse the OpSetup architecture employed by TVMNIRCompiler and introduce a new virtual function named SetupQuantParam.The implementation of SetupQuantParam is demonstrated in Algorithm 3. Notably, even when the model is prequantised, there still exist numerous general non-QNN operators.Our approach addresses this finding by directly and recursively propagating the output quantisation parameters to the input when encountering a non-QNN operation.BinaryOpSetup::SetupQuantParam is specifically designed to seamlessly forward the output QuantParams to both inputs.
Algorithm 3 checks if the output QuantParams for the current expression exists in the quantisation map (QuantMap).If they are present in the QuantMap, the algorithm initialises the QuantParams for the expression as empty (NirQuantParams).If the output Quant-Params for the current expression is detected in the QuantMap, the algorithm checks the QuantParams for both input nodes.For each input node, if its QuantParams are not in the QuantMap, the algorithm assigns the output QuantParams of the current expression to the corresponding input node.This step ensures that the output QuantParams is correctly forwarded to both input nodes, enabling the quantisation process to seamlessly work across various operator types.
In the case of relay operations associated with the QNN dialect, such as qnn.requantise, the SetupQuantParam method is adapted to directly extract QuantParams from the given input, as demonstrated in Algorithm 4. This adaptation is achieved by initially assigning the indices for input scale, input zeropoint, output scale, and output zeropoint to their corresponding constexpr variables.The input and output QuantParams are subsequently computed based on these indices.The final step involves storing the input QuantParams within the QuantMap alongside the output QuantParams, effectively ensuring that the necessary information is readily available for subsequent processing.

Performance evaluation and experimental results
We now describe the experiments that were performed to evaluate the performance of our method.First, the inference time was calculated using the TVM runtime module as input.Second, we cross-compiled it for the Android AArch64 platform using the Android Native Development Kit (NDK).Third, we needed to use TVM and NDK to compile the testing model into an Android-executable runtime module.Last, the built binary and module were pushed to the device using the Android Debug Bridge.The following sections describe the experimental configurations followed by performance evaluations of the TFLite and non-TFLite models.Notably, there are missing statistics for the Neuron APU since the model contains operators that cannot run on that APU.In our experiments, we measure execution time and performance by using the RPCTimeEvaluator function, which is part of the TVM's built-in utilities.To streamline this process, we abstracted this function into a standalone C++ component that executes and directly outputs the time results.These data are then transferred back to the host machine for analysis.This approach ensures an accurate and standardised measurement of the time taken for function execution across different setups and tests.

Experimental configurations
The experiments were conducted on an Android device with MediaTek Dimensity 800 SoCs; the hardware configuration is presented in Table 2. MediaTek's APU 3.0 represents an advancement in AI processing units and incorporates a six-core architecture that delivers a 10% performance increase compared to its predecessor.APU 3.0 can achieve superior performance and energy efficiency when executing multiple tasks in parallel.The AI processor unit in APU 3.0 offers 4 TOPS of AI computing power and supports high-resolution camera sensors of up to 32 MP and 30 fps (including an integrated real-time image signal processor, ISP, and MIPI-CSI interface).This design facilitates deep learning (DL), neural network (NN) acceleration, and computer vision (CV) applications.MediaTek's APU is compatible with various data formats, including INT8, INT16, and precision-focussed FP16.APU 3.0's capabilities enable it to support a range of AI camera features, such as AI bokeh and selective focus for single-camera configurations, making it an ideal solution for video selfie applications and enhancing the overall user experience.In the experiment, we use the TVM CPU and Neuron CPU to refer to the same physical CPU hardware utilised by different frameworks.TVM CPU refers to running the operations using the TVM framework on the CPU, whereas the Neuron CPU refers to running the operations using the MediaTek NeuroPilot framework on the same CPU.This comparison is aimed at presenting the efficiency and performance difference between these two frameworks when running on the same CPU hardware.The TVM employed in the experiments was based on version 0.9.0 Dev with NIR support, and the Neuron version was 6.0.0.TVM's AutoScheduler requires a substantial amount of time for autotuning, up to an hour.However, note that this is a one-time cost paid during compile time.Once the optimal configurations are obtained, they can be saved and reused, meaning that autotuning does not have to be performed again for subsequent deployments of the same model.Neurons do not use autotuning methods.Therefore, its compile time can be shorter than that of TVM's AutoScheduler.However, this paper mainly focuses on evaluating runtime performance, which is a recurring cost every time the model is executed, rather than the one-time cost of autotuning or compile time.In addition, our primary focus is demonstrating the integration of AI compilers and AI accelerators, as well as the fallback mechanism between them.Thus, we have chosen to present separate results for different precisions to more accurately reflect the specific performance improvements achieved in each case.

fp32 models
Figure 6 shows the speedup of fp32 models when the target was the CPU for native TVM, TVM with AutoScheduler (Zheng et al., 2020), TVM with BYOC to Neuron, and pure Neuron.In TVM with AutoScheduler, we set num_measure_trials to 1000 and performed tuning using the TVM RPC mechanism (which took 1 hour per model).The experimental results indicated a relatively large speedup by the AutoScheduler, even outperforming Neurons for most of the models.Figure 7 shows that when the target had APU, speedup of fp32 models occurred for TVM with AutoScheduler, TVM with BYOC to Neuron (APU), TVM with BYOC to Neuron (APU+CPU), Neuron (APU), and Neuron (APU+CPU).From the experimental results, we reach the following conclusions: • The proposed BYOC to Neuron can achieve speedups of 4× to 11× in the models are of FP32.• In most of the experimental models, our method approaches the speed of Neuron alone.
• The performance of TVM with BYOC to Neuron (APU) in Mobilenet_v1 is better than that of Neuron alone.This outcome is attributed to the APU's lack of support for the last softmax operator in the model.Due to this limitation, a fallback mechanism for TVM is needed, which introduces overhead associated with data movement between TVM and the APU.This additional overhead results in a difference in the observed performance, as the data movement between the two systems decreases the overall execution time.Despite this drawback, our approach of integrating TVM with Neuron still offers valuable benefits in terms of flexibility and support for a broader range of models and operations.

int8 models
Table 4 lists the MAC counts and the number of subgraphs generated by our work for the experimental TFLite int8 models.Figure 8 shows the performance speedups relative to TVM CPU for TVM with AutoTVM CPU, TVM with BYOC to the Neuron CPU, and the pure Neuron CPU.For these int8 model tests, we opted for AutoTVM over AutoScheduler.While both components of TVM are aimed at improving computational efficiency, they are significantly different.AutoTVM, an automated performance tuning system, seeks optimal operation implementations in a deep learning model and supports int8 optimisation.On the other hand, AutoScheduler, which is also known as Ansor, holistically schedules the entire computation graph, simplifying the optimisation process and boosting computation efficiency.However, AutoScheduler does not support int8 optimisation, thus excluding it from our int8 model experiments.Unfortunately, AutoTVM's XGBTuner has issues when the target is an ARM CPU. 1,2We therefore employed the AutoTVM's RandomTuner, setting trial to 5000 and tuning via the TVM RPC mechanism (which took approximately 5 hours per model).The experimental results indicate that native TVM and AutoTVM are significantly slower than Neuron for int8 models.
Figure 9 shows the performance speedups relative to using TVM with AutoTVM CPU for TVM with BYOC to the Neuron APU/CPU and the pure Neuron APU/CPU. Figure 10 is the same as Figure 9, with the exception of the referencing being performed relative to BYOC to the Neuron CPU.From these experimental results, we draw the following conclusions: • Compared with AutoTVM, APUs can achieve speedups of 40× to 80×.There were also speedups of 4× to 11× relative to BYOC to Neuron CPU.• In ssd_mobilenet, because the model has multibox_transform_loc, non_max_ suppression, get_valid_counts, and other operators that Neuron does not support, Neuron cannot make inferences.Even so, TVM could still use the APU to achieve an 8× speedup using the BYOC mechanism.

Non-TFLite models
The Neuron Compiler only supports TFLite models, so we also performed experiments using TVM as the front-end of Neuron.We verified the following models in different formats: • MXNet (T.Chen et al., 2015): alexnet (Krizhevsky et al., 2012), densenet, inceptionV3, resnet18_v1 (He et al., 2016), resnet50_v1b (He et al., 2016), and vgg19 (Simonyan & Zisserman, 2014), taken from the GluonCV model zoo (Guo et al., 2020) • PyTorch: DeePixBis (George & Marcel, 2019) • Keras: emotion detection (Gulli & Pal, 2017) Figure 11 shows the experimental speedups among TVM with AutoScheduler and TVM with BYOC to Neuron (CPU, APU, and APU+CPU).Our work has demonstrated that models that do not use the TFLite format can still benefit from APU acceleration by producing speedups of up to 6×.For the statistics of alexnet and emotion_detection, we discovered that the speedup of TVM-only was better than that offloaded to the Neuron (CPU) only.Moreover, offloading to Neuron (APU) was also better than offloading to Neuron (APU+CPU), which means that our approach combines the advantages of both or even surpasses them.

Related work and discussion
We now discuss recent studies on integrating neural network APIs into AI compilers for edge devices and explore the collaboration between AI compilers and AI accelerators in optimising AI models.
Several recent studies have focussed on integrating neural network APIs into AI compilers for edge devices.For example, a proposed flow (Lai et al., 2020) integrates Android's NNAPI into TVM-generated inference models using a partitioning algorithm to determine which parts of the model should and should not be computed on the NNAPI.Their experimental results showed that properly partitioned models can achieve significant speedups using NNAPI when compared with pure TVM-generated CPU inference.
Intel's Distiller (Zmora et al., 2019) is an open-source Python library for neural network compression that provides tools to explore and experiment with various model compression techniques, such as pruning and quantisation.Distiller supports popular deep learning frameworks such as PyTorch, allowing for the creation of customised compression pipelines tailored to specific use cases.In conjunction with AI accelerators, Distiller can help optimise AI models so that they efficiently execute on edge devices.One example was presented by Liao et al. (2021), who used pruning from Distiller as an input to perform sparse compression, yielding a compressed model with a reduced memory footprint and improved inference speed.
A study (Yang et al., 2023) integrated quantisation techniques with the RISC-V Packed extension (P extension) for efficient AI computing.The method involves adding a fixedpoint type, supported by a 16-bit integer type and saturation instructions, to replace the original 32-bit floating-point type in the RISC-V P extension.This approach aligns with linear-based quantisation techniques since it reduces the memory footprint and computational complexity by using lower-precision representations.The authors proposed an autotuning method that employs a uniform selector mechanism to determine the optimal binary point position for using fixed-point types.The work highlighted the potential of combining quantisation techniques with hardware optimizations to improve AI inference efficiency on edge devices.
The integration of AI compilers and AI accelerators plays a crucial role in optimising AI models, especially for edge devices that have relatively low computational power and memory resources (Sze et al., 2017).For example, PyTorch provides a mechanism for binding custom CUDA kernels or CUDA-accelerated C++/CUDA code to Python using the torch.utils.cpp_extensionmodule (PyTorch, 2021), similar to Listing 3, which demonstrates the creation of a custom operation using TVM.By combining the strengths of AI compilers (e.g.TVM) and hardware accelerators, it is possible to achieve greater performance improvements and more efficient AI inference.As research in this area continues to advance, we expect to observe further progress in the development of more sophisticated techniques for integrating AI compilers and AI accelerators to optimise AI models for use on edge devices.
Transformers (Vaswani et al., 2017), GANs (Goodfellow et al., 2020), and RNNs (Rumelhart et al., 1985) encompass diverse areas within AI.Transformers are pivotal in natural language processing, leveraging attention mechanisms to comprehend different parts of an input sequence.GANs generate synthetic data that closely resemble the input, achieved through a competitive process between two neural networks.RNNs excel in detecting patterns in sequential data, rendering them indispensable for tasks such as text or speech processing.These architectures offer unique computational challenges and require specific optimisation strategies.Our current work, while centered around quantisation and edge device optimisation, establishes a flexible framework.In future research, this framework could be extended to carry out comprehensive experiments on these varied model architectures, enhancing the potential of AI deployment on edge devices.Furthermore, we acknowledge the importance of energy efficiency for edge devices, and in our future work, we intend to incorporate energy consumption evaluations (Y.H. Chen et al., 2016;Wang et al., 2017) alongside performance metrics.

Conclusion
We have described a methodology for incorporating MediaTek Neuron into TVM, enabling a more advanced quantisation flow.Importantly, we developed a technique for transforming the quantised representation from a tensor-oriented approach into an operator-oriented approach.This design allows developers to take advantage of the APU in the SoC architecture to expedite applications that utilise TVM for performing inference without necessitating additional effort, thus streamlining the overall development and application process.Our findings indicate that merging TVM CPUs with the Neuron APU in specific models yields superior performance relative to Neuron-only and TVM-only configurations.In the case of quantised models, our approach using APUs can achieve up to a 70× speedup compared to AutoTVM CPU.For floating-point models that are of FP32, the proposed BYOC to Neuron yields speedups of up to 11×.The primary advantage of our approach lies in its ability to leverage the strengths of various hardware configurations while using TVM as a fallback to compensate for their limitations.For example, in cases where specific operations are not supported by the Neuron APU, we can fallback to using TVM or leverage TVM AutoScheduler to maintain compatibility while still achieving respectable performance.When operations are supported by the APU, we utilise the Neuron APU to maximise execution efficiency.This work underscores the potential of our innovative design approach for accelerating AI performance and simplifying the development process, ultimately contributing to advancements in edge computing and AI optimisation.

Figure 1 .
Figure 1.Relay AST, which is obtained from parts of Mobilenet_v1.

Figure 2 .
Figure 2. Brief overview of the TVM BYOC flow.
Simplified definition of ExprVisitor.
Quantization representation format of Neuron.

Figure 6 .
Figure 6.Performance speedups relative to the native TVM CPU for TVM with AutoScheduler, TVM with BYOC to the Neuron CPU, and the pure Neuron CPU (for fp32 models).

Figure 8 .
Figure 8. Performance speedups relative to the TVM CPU for TVM with AutoTVM CPU, TVM with BYOC to Neuron CPU, and the pure Neuron CPU (for int8 models).

Figure 9 .
Figure 9. Performance speedups relative to using TVM with AutoTVM CPU for TVM with BYOC to the Neuron APU/CPU, and the pure Neuron APU/CPU (for int8 models).

Figure 10 .
Figure 10.Performance speedups relative to using TVM with BYOC to the Neuron CPU for TVM with BYOC to Neuron APU/CPU, and the pure Neuron APU/CPU (for int8 models).

Figure 11 .
Figure 11.Performance speedups relative to using TVM with AutoScheduler CPU for TVM with BYOC to the Neuron CPU/APU on Non-TFLite models.

•
VarNode: x is a VarNode that represents a variable in the relay graph.Variables typically serve as inputs or outputs for different operations, storing intermediate values during computation.• ConstantNode: const0 and const1 are examples of ConstantNodes.These nodes represent constant values or fixed tensors in the relay graph, such as weights, biases, or other fixed parameters that remain unchanged during computation.

Algorithm 1 :
Enabling the flow from the relay module to the NIR module Input: R: a Relay module; Result: NIR_module: a converted version of the input Relay module for the Neuron Platform 2Initialize TVMNIRCompiler class, inheriting from ExprVisitor.3TraverseRelay AST using post-order DFS. 4 for each node in R do 5 Override VisitExpr_ based on node type 12 end 13 Update NodeEntry inputs and outputs.14 Insert NodeEntry into nodeDict.15 end 16 for TupleNode and CallNode do 17 Process the child nodes and update the inputs and outputs.18 end 19 for CallNode do 20 Create the Neuron operation using opTable and update nodeDict.

21 end 22 end 3.2.1. Converting the TVM structure to a neuron structure
Integrating TVM and Neuron requires conversion of their different data structures.Table1shows the mapping table for the Relay IR structure and the NIR structure.Additional details on supported NeuronOperationType, NeuronOperandType, and Quantize Parameters can be found in MediaTek's GitHub repository (tflite-neuron-delegate, n.d.) in the file neuron_types.h.

Table 1 .
Mapping table of the TVM Relay IR and Neuron IR structure.
To minimise coupling and enhance scalability, we employ polymorphism, C++ alias templates, and template specialisation to implement the converter for each operator.Lines 1 to 8 in Listing 3 define the base class of the operator handler, named OpSetup, where all derived classes override the CreateOperation function.For example, lines 10 to 19 present a class template named BinaryOpSetup that extends OpSetup, and its CreateOperation function generates different binary operator layers depending on the provided template argument.Lines 21 to 24 show specialisation templates that utilise the C++ alias template to instigate the derived class of a binary operator.Last, lines 26 to 37 demonstrate the process of placing the supported operator and the corresponding OpSetup into the opTable, enabling the conversion of different operators in a unified manner via polymorphism.

Table 3 .
MAC counts and number of sub-graphs in TFLite fp32 models.

Table 4 .
MAC counts and numbers of sub-graphs in TFLite int8 models.