Elementary operations: a novel concept for source-level timing estimation

ABSTRACT Early application timing estimation is essential in decision making during design space exploration of heterogeneous embedded systems in terms of hardware platform dimensioning and component selection. The decisions which have the impact on project duration and cost must be made before a platform prototype is available and software code is ready to be linked and thus timing estimation must be done using high-level models and simulators. Because of the ever increasing need to shorten the time to market, reducing the amount of time required to obtain the results is as important as achieving high estimation accuracy. In this paper, we propose a novel approach to source-level timing estimation with the aim to close the speed-accuracy gap by raising the level of abstraction and improving result reusability. We introduce a concept – elementary operations as distinct parts of source code which enable capturing platform behaviour without having the exact model of the processor pipeline, cache etc. We also present a timing estimation method which relies on elementary operations to craft hardware profiling benchmark and to build application and platform profiles. Experiments show an average estimation error of 5%, with maximum below 16%.


Introduction
Systems on Chip (SoC), which are used to run modern complex applications, must have the heterogeneous structure of processing, memory and communication elements to meet high performance, energy efficiency and low price goals. Due to the exponential growth of heterogeneous system complexity, it is estimated that designers productivity will have to increase up to ten times to successfully meet system requirements and constraints within the similar time and cost limits [1].
The key to success is making good decisions in early design stages, before assembly of the first prototype. Raising abstraction level in all design phases enables separation of computation from communication and using separated application and platform models. This leads to a more efficient approach to design space exploration (DSE) [2]. Early timing estimation is one of the most important phases in DSE. In recent years, the traditional approach using highly accurate Instruction Set Simulator (ISS) has been replaced by high-level timing estimation models which enable obtaining estimates in early design stages [3][4][5][6][7][8][9][10].
In this paper, we propose a source-level application execution time estimation method based on a concept named elementary operations which enables capturing architectural effects and compiler optimizations influence. The estimation method consists of two phases: analysis and estimation. In the analysis phase, application and platform configurations considered for design are profiled. Application profile is obtained by transforming application source code into a list of elementary operations structured in loops, branches and sequences. It is independent from platform and compiler optimization level, and hence the same application profile can be used to estimate execution time on any platform. Platform profile is obtained by executing a benchmark entitled "ELOPS benchmark" 1 on every platform configuration and for each compiler optimization level separately. This benchmark was specially crafted as a part of this research to measure execution times of elementary operations on real platforms. Results of benchmark run on each platform configuration make the platform profile for that respective configuration. In the estimation phase, the proposed timing estimation algorithm combines application and target platform profiles to provide timing estimate.
Accuracy of our approach is evaluated using the JPEG image compression algorithm and Advanced Encryption Standard (AES) algorithm on several hardware configurations based on two RISC processors: ARM A9 and Microblaze, custom-built for Xilinx Zynq-based ZC706 platform. Achieved accuracy (i.e. error rate) is similar to the most accurate state of the art source-level timing estimation methods. The strong point is a significant reduction in time and effort required to obtain results due to reusability of application and platform profiles. Also, the proposed method can be easily scaled for systems with hundred or more elements of the same type, in a similar fashion to the method demonstrated in [11].
The rest of this paper is organized as follows. Section 2 provides an overview of the current state of art in area of high-level timing estimation. The proposed method for source-level timing estimation is presented in Section 3. The flow of application timing estimation is described in Section 4. Test cases used for evaluating the proposed method are given in the first part of Section 5. The results of the conducted experiments are presented and discussed in the rest of the section.

Related work
Authors in [12] propose a source-level simulation infrastructure that provides a full range of performance, energy, reliability, power and thermal estimation. For timing simulation, they build upon their previous work [3] which uses simulation-based approach with back annotation on intermediate representation (IR) level. Simulating pipeline effects on basic block boundaries requires additional pair-wise block simulation for every possible block pair combination on a cycle-accurate reference. They consider a high-level cache model by reconstructing target memory traces solely based on IR and debugger information. Simulation of the entire application execution is done using System C and transaction-level modeling (TLM) [13] with estimation error below 10%. Simulating pipeline effects on basic block boundaries requires additional pair-wise block simulation for every possible block pair combination.
Other approaches use machine learning and mathematical models for early timing estimation. Authors in [8] use artificial neural networks (ANN). ANN gives timing estimate based on execution time and total number of each instruction type. Estimation error is around 17% but the method is much more flexible compared to simulation methods and provides a higher level of result reusability. After the initial training period, estimation results are obtained rapidly.
Methods presented in [9] and [10] are based on linear regression and ANNs with higher error rates -around 20%. Authors in [14] use model tree-based regression technique as a machine learning method of choice.
Authors in [11,15,16] propose hybrid methods: first simulation is used to obtain the execution time of each procedure on each type of processing element, then analytical methods are used to account for cache and communication effects.
In [17], authors use linear regression for calculating timings but they use a set of specially crafted training program to identify instruction costs of an abstract machine. Authors try to capture effects of cache, pipeline and code optimization by crafting examples with longer instruction sequences and loops. However, since they rely on IR, they face challenges when introducing code optimizations because virtual instructions in the translation of the training program are not in close correspondence with the compiled version.
The concept of elementary operations has first been introduced in [18] in attempt to characterize platform behaviour without having the exact hardware model. This preliminary method for early timing estimation lacked compiler optimization support and ability to estimate input dependent application tasks. In this paper, we extend our previous work and present an improved method.

Elementary operations approach
Our approach is based on decomposing a piece of source code written in C programming language (standard C11) to elementary operations -distinct parts of source code which enable capturing platform behaviour without having the exact model of processor pipeline, cache etc. The set of elementary operations is finite with several subsets: integer, floating point logic and memory operations. These sets are co-related to parts of RISC-like architecture processor and memory datapath.

Classification of elementary operations
We propose a multi-level elementary operations classification scheme. The top level contains four operation classes: INTEGER, FLOATING POINT, LOGIC and MEMORY. Second level of classification is based on origin of operands (i.e. location in memory space): local, global or procedure parameters. This stems from the difference in locality due to the way compiler implements operands stored in different parts of memory space. It is expected that each group will show different timing behaviour: local variables, being heavily used, are almost always in cache, while global and parameter operands must be loaded from an arbitrary address and can cause a cache miss. Third level of classification is by operand type: (1) scalar variables and (2) arrays of one or more dimensions. Pointers are treated as scalar variables when the value of pointer is given using a single variable, or as arrays when the value of pointer is given using multiple variables.
Operations which belong to INTEGER and FLOAT-ING POINT classes are: addition (ADD), 2 multiplication (MUL) and division (DIV). LOGIC class contains logic operations (LOG): (i.e.and, or, xor and not) and shift operations (SHIFT): operations that perform bit-wise movement (e.g. rotation, shift, etc.). Operations in MEMORY class are: single memory assign  (ASSIGN), block transaction (BLOCK) and procedure call (PROC). MEMORY BLOCK represents a transaction of a block of size 1000 and it can only have array operands. MEMORY PROC represents a function call with one argument and a return value. Arguments can be variables and arrays, declared locally or given as parameters of the caller function, but never global. All of these operations are listed in Table 1. Abbreviations indicated in the table are used further on when referring to a specific class. Sample source code given in Figure 1 illustrates how code statements can be correlated to elementary operations classification scheme. Each operation is denoted using abbreviations from Table 1. Accuracy of timing estimation using the proposed classification scheme was analysed on two RISC processors: ARM Cortex-A9 and Microblaze, implemented on Xilinx Zynq-based ZC706 platform. First, the actual execution time of each elementary operation from Table 1 was measured for each processor. Each operation was repeated in a for-loop a thousand times to compensate for timer setup and to create a context to capture the effects of compiler optimizations, cache and pipeline. Then, test cases were crafted in a way to contain constructs commonly found in real-world application code. For each test case, elementary operations were identified and, using previously obtained execution times, a timing estimate was calculated. Finally, each test case was executed on both target processors in order to obtain actual execution times and compare them to estimated ones.

Sequence of operations
A sample source code given in Figure 2 illustrates four examples of sequences of operations: (1) five INT_loc_var ADD operations in a single statement (2) for-loop with five statements in a sequence, each containing one INT_loc_arr ADD operation  It must be noted that in our approach for-loops are considered to be an implicit part of operations with array type of operands. This is because when the execution time of each elementary operation is measured, it is done in a loop and all overhead added by the loop is already included in the obtained timings.
According to the initial classification scheme proposal, each elementary operation is treated separately during code analysis and timing estimation. Estimated and actual execution times for the source code in Figure 2 are presented in Table 2. Estimation error is calculated using formula It can be observed that execution time per one elementary operation decreases exponentially with the increase in total number of operations in a sequence. The same behaviour is observed for all other types of elementary operations on both processors, but for the sake of brevity is not shown here. This leads to conclusion that due to pipelining and decreasing of loop overhead, sequence lengths plays an important role when estimating timings of sequences of operations which belong to the same class. Experiments also show that by measuring timings only for several lengths such as 2, 3, 4, 5, 10, 20 and 50 an approximation with less than 10% error can be done for any other sequence lengths. This makes profiling a much faster and more efficient process.

Operations with mixed types and origin of operands
Source code in Figure 4 represents four cases of operations with mixed type and origin of operands. The initial elementary operations classification scheme proposal does not give explicit specifications for determining elementary operation class in such cases. Thus, we additionally introduce origin priority. Priorities are defined based on difference in locality due to the way the compiler implements each of the operand type (from highest to lowest) and the idea is to select  operation class based on the operand with the highest priority: (1) parameter array (2) global array (3) local array (4) parameter variable (5) global variable (6) local variable All operations in the given example are classified as operations with global operands, even though OP1, OP3 and OP4 contain local variables while OP2 and OP4 contain local array operands. The comparison of estimated and actual execution times for these test cases is presented in Table 3. It shows that in the case of OP2 and OP3, where local operands are present, estimation error goes over 20%. Error for OP4 is slightly lower, but this is probably because the proposed method gives underestimation in case of local arrays (OP2) and overestimation in case of local variables (OP3), so the numbers even-out.
These results indicate that it is necessary to modify the existing solution by expanding the origin priority-based approach and give each operation additional attributes to denote different types and origin of operands. Attribute mod will be added in case when an operation has operands of mixed types and origin. Attribute value will indicate the following cases:  (6) presence of constants. List of these values is given in Table 5 in column Values for operation modifier attribute mod.

Array operands index
Index of array operands can have more than one dimension and/or can be calculated using the values of more than one variable. The same applies to struct types in C source code. Sample source code in Figure 5 illustrates such examples. OP1 is an example of MEM_par_arr ASSIGN operation with a 3-dimensional array, and OP2 is a similar example with a struct containing 2dimensional array. At this point, operations with structs are classified as operations with arrays. Operations OP3 to OP7 are examples of arrays with index calculated based on values of several variables. Table 4 shows the results for code in Figure 5. In case of OP1 and OP2 estimation error is above 60%. Almost the same error is observed in case of 3-dimensional array and struct containing 2-dimensional array (making it also a 3-dimensional structure). In case of OP3 to OP7 it can be observed that the results are severely underestimated, but it can also be observed that the underestimation increases as the number of operations required for index calculation increases.
These results suggest that the initial classification scheme, which does not recognize multiple dimensions and index structure, should be extended even further. Several attributes will be added to operations with arrays to indicate specifics about index type. These attributes will indicate (1) type of array index, which can be simple -given using a single variable, complex -calculated based on two or more variables and a constant,   Sequence of operations seq positive integer Operation modifier mod "var" -at least one variable operand of the same origin is present "glob_var" -at least one global variable operand is present "glob_arr" -at least one global array operand is present "loc_var" -at least one local variable operand is present "loc_arr" -at least one local array operand is present "const" -at least one constant operand is present Index modifier type "simple" -index is given as a single variable "complex" -index must be calculated using more than one variable "const" -index is a constant value List of these attributes is given in Table 5 under index modifier attribute description. Finally, since arrays and structs show similar timings, they will continue to be treated equally.

Classification scheme attributes overview
Based on previously discussed observations, the classification scheme is extended to incorporate the proposed modifications. These extensions are included as attributes to each class of elementary operations. According to the three groups of possible cases which have effect on execution duration of elementary operations defined in Table 1, three groups of attributes are listed in Table 5 under column Attribute group. Sequence of operations is denoted with attribute seq. Operation modifier group contains attribute mod which is present in operations with operands of mixed types and origin. Attribute value is a space separated list which can contain one or more of the following elements as listed in Table 5  Index modifiers is a group of four attributes, all listed in Table 5, which are added to elementary operations with arrays.

Application timing estimation
The proposed application execution time estimation method based on elementary operations consists of analysis and estimation phase as indicated in Figure 6. The first step of analysis phase is platform profiling. In this step a specially crafted ELOPS benchmark, described later, is compiled and run on every platform configuration and for each optimization level separately. The platform profile is created based on the results of benchmark runs and contains timings of elementary operations. The second step is application profiling. Application profile is a transformation of original C source code into a list of elementary operations structured in loops, branches and sequences. Application profiling is done only once on the original C source code. For this purpose common compiling constructs such as abstract syntax tree (AST) and control and data flow graph (CDFG) are used. Application and platform profiles created during the analysis phase are permanently stored in database.
In estimation phase, first platform and application profiles are retrieved from database. Then a timing estimation algorithm, described later in this section, combines application and platform profiles to provide timing estimate.

Platform profiling
Platform profiling starts with the execution of ELOPS benchmark 3 on every platform configuration considered for final design.A platform configuration is a pair of a specific processor and a memory connected to it, used to store instructions and data.
ELOPS benchmark is designed based on elementary operations classification scheme to measure execution time of each operation from Table 1 and timing effect of every possible attribute listed in Table 5. For each operation sub-class listed in Table 1 (e.g. INTE-GER ADD, LOGIC SHIFT etc.), three main groups of benchmark entries are defined: local, global and parameters. Each group contains two sub-groups: variable and array. Array sub-group branches further by two criteria, array index type and dimension. This means that for each operation from Table 1 and for each origin operand group, there are five base benchmark entries. All benchmark entries are systematized in Table 6.
Each base benchmark entry has sub-variants in which the different lengths of sequences of operations are measured. The distinction is made between multiple occurrences of the same operation in one statement -named sequential operations, and sequence of statements belonging to same elementary operations class -named sequential statements. In our current implementation, the benchmark contains entries for the following sequence lengths: 2,3,5 and 10.
All measurements are done by executing an operation in a loop for a thousand times to compensate for timer setup effects and to create a context which will capture effects of optimizations and hardware features such as cache and pipeline better. The special case are two elementary operation sub-classes: MEMORY BLOCK an MEMORY PROC. The MEMORY BLOCK class is measured as a single transaction of a block of size 1000 (using memcpy function) and it can have only array operands. MEMORY PROC class is measured as a function call with one argument and a return value.
In our implementation benchmarks do not contain entries for arrays with an index of dimension higher then 3 because at this point, there was no code in our test applications which contained structures of higher dimensions.
In order to enable accurate estimations for code containing compiler optimizations, the benchmark has to be compiled and run separately for all optimization levels. This way the same optimizations which will occur in e.g. looped or sequential execution in the source code, will also be present in the benchmark code. The measurements are then combined into a platform element profile in an XML document.

Application profiling
Application profiling uses code analysis and profiling method described in [19] which is slightly adapted to be compatible with the elementary operations classification scheme. Application source code processing starts with the generation of call-tree statistics to produce profiling information at procedure call-graph abstraction level. The compiler transformation flow starts with parsing the source code to the abstract syntax tree (AST). During recursive traversal of the tree, information about data structures, types of variables and procedure arguments is used to identify elementary operations according to the proposed classification scheme. AST is further transformed to a control and data flow graph (CDFG) representation by using recursive traversal of the tree with the introduction of temporary variables that form the threeaddress code statement notation. During that process, the key is recognizing points where uniform instruction flow is broken by condition testing in branch or loop jump conditions. The final application profile is obtained by unifying procedures calls statistics and profiles obtained using AST and CDFG for each procedure separately.
The output of the entire process is application profile as an abstract model written in an XML structure. In it, the original application source code is transformed into a multi-level structure of elementary operations organized in loops, branches and sequences. Each application is composed of one or more procedures which directly correspond to procedures (functions) in original C source code. Procedures can contain any number of loops, branches or operations. Loop represents a for or a while loop, and branch represents an if-else or switch-case conditional constructs. Loops and branches can have any number of loops, branches and operations as sub-elements. Operation represents a single statement or a sequence of operations that has been assigned an elementary operation class. Operations have attributes which cover the extension to classification scheme as discussed in Section 3.1.1. All possible profile elements and attributes are listed in Table 7.
For applications which have data-dependent behaviour, the precision of profiling can be highly dependable on input data in run-time. In such cases, loops iteration count or branch condition evaluation result cannot be resolved without simulation and analysis of variable data values. Since these facts define the number of existing running paths through the application source code both during analysis of hierarchical task graph and formation of control and data flow graph, estimation must rely on one or more simulation runs to determine either the exact number or upper and lower boundaries and statistical probabilities for these values. In our research so far, we have employed a commonly accepted approach of running instrumented code on a host PC (i.e. host-compiled) for determining data-dependent behaviour, [3,4,7].

Timing estimation
After obtaining both application and platform profiles, the final step is to combine the two to estimate application execution time. The algorithm is described in short using pseudo-code in Figure 7.
Finally, it is important to accent the reusability of the proposed method. Each application needs to be profiled only once and the obtained profile can be used in the future as is for any platform configuration at hand. In the same manner, each type of platform element needs to be profiled only once and the obtained data can be reused for timing estimation of any other application. The reusability of profiling results also helps achieve better scalability when building platforms with multiple elements of the same type.

Experimental setup and results
Verification of elementary operation approach has been done on the commonly used real-world applications: (1) The Advanced Encryption Standard (AES) [20] used in two different implementation versions.
(a) AES_G -the first version of the AES where data is accessed via global variables, (b) AES_P -the second version of the AES where data is accessed via procedure parameters. (2) JPEG image compression algorithm -using implementation as described in [21].
This particular set of applications encompasses all types of elementary operations and represents well the key features of applications for which heterogeneous embedded systems are used most often: multimedia, compression and encryption.
Xilinx Zynq ZC706 reconfigurable evaluation board has been chosen as target platform. Three configurations, each composed of one processor and one memory element have been used: (1) MB1 -MicroBlaze, a 32-bit RISC Harvard architecture soft processor core in the following configuration: 5-stage pipeline with hardware multiplier, barrel shifter and floating-point unit operating at 200 MHz. The processor is connected to 128 KB FPGA-based BRAM memory, operating also at 200 MHz, via local memory bus (LMB). This memory is used for storing both instructions and data. (2) ARM1 -A single core of ARM Cortex-A9 processor is used in the following configuration: operating frequency at 667 MHZ, 32 KB L1 cache and 512 KB L2 cache with both instructions and data These configurations are a popular general purpose choice for low-power or thermally constrained, costsensitive devices (e.g. smart-phones, digital TV, and both consumer and enterprise applications enabling the Internet of Things).

Test cases
AES_G and AES_P have been tested for an example input of 32 bytes of data and JPEG has been tested on Lenna image. Each test case and each platform configuration have been profiled and timing estimation has been calculated based on these profiles using method described in Section 4. Then, each test case has been executed on each platform configuration to obtain actual timings. Total execution time for AES_G, AES_P and JPEG was measured as the time taken for the entire application to run. Parts of AES_G and AES_P (AddRoundKEy, ShiftRows, etc.) were measured in 1000x loops because of very small time scale, to negate timer setup effects. Parts of JPEG applications were not measured in a loop because time scale is orders of magnitude larger than the timer setup overhead. Tests were performed for optimization level O0 -O2. Optimization level O3 has not been considered since level O2 is still the recommended option by most embedded systems manufacturers in order to avoid potentially incorrect execution if the source code is not written exactly following C standard, [6]. Results for second AES implementation -AES_P, for all three platform configurations are presented in Table 9. The achieved average error is around 6% with minimum below 1% and maximum below 16%. Table 10 shows results for JPEG test case. Timing estimation has been done for all three configurations:  To summarize, for all three test applications and for all three target platform configurations estimation accuracy remains approximately at the same level, with the average error around 5% and the maximum error below 17%. Estimation accuracy shows no significant degradation for any level of compiler optimization. Even for ARM2 configuration, there is no deviation in error rate compared with results on the other two configurations. This particular configuration is more sensitive to cache effects because processor communicates with a very slow memory. In case of inability to accurately capture cache hits -method would give overestimation, or in case of cache miss -underestimation. However, it must be noted that for all three test cases there was much larger chance of having a cache hit than a cache miss, because memory footprint of each of these applications remains within range of 50 KB -250 KB. This means they fit well to cache size typical for embedded processors like ARM and likelihood for cache hits is much larger. On the other hand, all three test applications represent well, in both size and structure, common tasks for which embedded systems are used for: signal processing, vector and matrix operations, numeric calculations, search and sorting [22]. Comparing to results achieved by analytical methods, which have an average error in the range from 17% [8] to 20% [9,10], our results are better. They are, however, slightly worse than those obtained using simulation methods which achieve estimation error below 10% in worst cases, [3][4][5][6][7]. But compared to simulation methods, the strong point of our method is reusability of profiling results because both application and platform profiles can be reused in future. In that way, our method enables obtaining accurate source-level estimation in a shortened amount of time and helps close the gap between accuracy and speed.

Conclusion and future work
In this paper, we have proposed a method for sourcelevel application execution time estimation in a heterogeneous computing environment based on a concept named elementary operations. The method features a classification scheme used for identifying elementary operations in the source code. It enables profiling applications and platforms in a way which successfully handles compiler optimizations, pipeline and cache effects. This enables providing accurate application timing estimation while keeping the required effort input within reasonable limits. Based on the classification scheme, ELOPS benchmark is designed to measure execution time of each elementary operation type, within context like loops and sequences of operations, on real platforms.
Experimental results show an estimation error to be around 5% with maximum below 17%, which is comparable to best state-of-art simulation methods. The strong point of this method is that platform profiling needs to be done only once for each hardware configuration and these results are reused again later for any other application which is executed on the same hardware. The same applies to the application profiling: each application has to be profiled only once and the obtained profile can be used as is for any platform configuration at hand. Reusability of profiling results helps achieve better scalability when building platforms with multiple elements of the same type.
In the future, emphasis will be put on the full integration of the method into design space exploration process for heterogeneous multi-processor and multimemory environments to eliminate the need to re-link and recompile source code using different development environments. Finally, application analysis will be improved by automating the instrumentation process in data-dependent parts of code.