Exploration for Software Mitigation to Spectre Attacks of Poisoning Indirect Branches

ABSTRACT Speculative execution and branch prediction are techniques that are widely used in modern superscalar processors to exploit instruction-level parallelism. Recently, researchers have discovered a new kind of attacks named Spectre which exploits speculation mechanisms with a side channel. Since speculation is widely used in modern superscalar processors, these vulnerabilities are found in many popular processors. Exploiting the security vulnerabilities, the attacker can leak the host memory from inside a KVM guest. While the hardware providers are trying to fix the issues from the microarchitecture designs in the next generation of their products, software mitigation are always desirable. Retpoline is a pure software fix developed by Google and is claimed to have a negligible impact on performance. In this paper, we look into the details of Retpoline and evaluate it with various workloads. We found that Retpoline does have impact on performance to the existing software but varies depending on how applications interact with the kernel. According to our experiment, it shows more regression on the network I/O than the storage. The more a program relies on the kernel, the greater regression it shows. To alleviate the impact, we propose a method that uses userspace network stack. We verify the proposal using Netmap userspace packet I/O framework. Besides, we observe great performance regression of applying Retpoline in some of userspace applications such as Perl interpreter which is thought to be the targets exploited by the new type of attacks. In the end, we review the experiment results and discuss the potential mitigation of Spectre in future.


INTRODUCTION
Modern superscalar processors are designed to exploit instruction-level parallelism (ILP). Speculative execution and branch prediction are the two most widely used techniques to maximize performance. Speculative execution allows the processor to execute instructions before being certain whether this execution should happen, while branch prediction can predict the next instruction to fetch and avoid stalling of control dependencies.
Traditional security researches try to avoid sensitive information leakage through covert channels by restricting information-flow policies [1] or data encryption [2]. However, attackers can bypass those restrictions or encryptions and leak sensitive data through the side channel. Since personal multimedia information is stored in the cloud service, privacy protection becomes increasingly important [3][4][5]. Recently, researchers have discovered a new kind of attacks named Spectre which exploits speculative behavior of modern processors and side channels [6,7]. By training the branch predictor, the adversary can misdirect the processor to execute unauthorized code speculatively and leak information through side channels. There are mainly two variants of Spectre attacks. One leverages conditional branches, and the other leverages indirect branches. Mitigating new attacks is thought to be expensive at the moment because it is based on common optimization techniques on modern processors. Since Spectre utilizes speculative execution and cache, simply disabling either of two would cause great degradation of performance.
Engineers from Google propose a pure software fix called Retpoline [8,9], which addresses the Variant 2 of Spectre and claims that it generally has negligible impact on performance. Retpoline replaces the vulnerable indirect branch instruction with a trampoline. It utilizes the function return instruction which does not depend on the branch target buffer (BTB) to implement an indirect branch. In this paper, we evaluate Retpoline with various benchmarks for both kernel and userspace application. We found that Retpoline has impact on the performance of the existing software but varies depending on how applications interact with the kernel. According to our experiment, it shows about 7% regression on the network I/O, while there is little overhead observed on storage I/O. Since the performance regression of Retpoline is brought in through invoking kernel in applications, SPECCPU benchmark is not affected. The more the applications rely on the kernel, the greater the regression. To alleviate the impact, we propose a method that uses userspace network stack. We evaluate the proposal using Netmap [10] userspace packet I/O framework. By using Netmap userspace network stack, the TCP transmission rate can achieve almost the same throughput compared to the non-Retpoline system. In the end, we review the experiment results and discuss the potential mitigation of Spectre in future.
The rest of the paper is organized as follows. First, we describe the background of the new attacks and existing mitigations in Section 2 Then we introduce Retpoline techniques and explain how it can be used to mitigate the Variant 2 of Spectre in Section 3 Later, we illustrate the setup of our evaluation experiment and results of our tests in Section 4. In the end, we analyze the results and make our discussion in Section 5.

BACKGROUND
This section introduces the background of the new attack methods. We first describe the micro-architecture mechanism behind it and the concept of the side channel. Then we look into the details about Spectre and briefly review existing mitigations.

Speculation with Branch Prediction
Exploiting ILP is one of the key goals in high performance processor design. Many techniques have been developed to maximize ILP. Branch prediction and speculative execution are two representative methods and have been adopted widely in modern superscalar processors.
When a branch instruction is executed, it may change the original increment of PC which is used immediately in the fetch stage of the next instruction in pipeline. Since the new value of PC would be determined only after the branch instruction has been decoded, there would be stalls for the next instruction to wait for the PC value of the next fetch. Branch prediction is done by the processor to try to determine where the execution will continue after a conditional jump, so that it can read the next instruction from memory without any stalls. Speculative execution goes one step further and determines what the result would be from executing the next instructions. If the branch prediction was correct, the result is used, otherwise it is discarded. A common technique to improve the efficiency of speculation is to only speculate on instructions from the most likely execution path, which means to combine with branch prediction.
In other words, branch prediction determines which execution path to choose. Speculation executes instructions along the path while the condition that depends on the result of previous instructions is still being determined in the pipeline.

Side-Channel Attacks
Side-channel information is the information that can be obtained from the physical states of a system rather than the plaintext of information itself. A side-channel attack targets to utilize the side-channel information to perform attacks rather than the weakness in the system implementation itself. Depending on the physical characteristics of systems, there are various kinds of side channels which can be used by the adversaries to leak sensitive information. Timing information [11], power consumption [12,13] and electromagnetic leaks [14] are all sources of information and can be exploited to perform sidechannel attacks.
Modern processors use cache to fill up the speed gap in memory hierarchy. At the same time, it introduces uncertainty to the system that time of memory accesses varies depending on whether the data are in cache or not. Cache timing attacks are a specific type of side-channel attack that exploit the effects of the cache memory on the execution time of algorithms. The attacker can determine the addresses allocated into cache by measuring the time taken to access entries and leak information. There are several techniques to exploit cache that have been demonstrated already. Prime+Probe [15][16][17] is the one that the attacker fills one or more cache lines with its own contents, waits for the victim to execute and then probes by timing accesses to preloaded cache lines. If the attacker observes remarkable increased memory access latency, it means that the cache lines have been evicted by the victim who has touched an address that maps to the same set. Flush+Reload [18] is contrast to Prime+Probe. The attacker first flushes targeted cache lines, waits for the victim to execute and then reloads the flushed cache line by touching it, in the meanwhile measuring the time taken. If the attacker observes a fast memory access, it means that the cache lines have been reloaded by the victim. Evict+Time [19] compares the overall execution time of the victim after evicting some cache lines of interest with a baseline. The variation of overall execution time is then used to deduce whether the lines of interested have been accessed by the victim.

Spectre Attacks
From a security point of view, speculative execution with a wrong prediction path of branch may lead the processor to execute some malicious codes that should not have been done. Although the roll-back mechanism of speculation guarantees that all the states in the processor cores modified by speculative execution would be restored when the prediction fails, the side channels such as caches may still retain some modifications which might be used to leak the sensitive information unintentionally.
By performing speculative memory accesses to cacheable locations beyond an unresolved branch in Spectre attacks, the result of those reads can themselves be used to form the addresses of further speculative memory reads. The effects of the second speculative allocation within the caches can then be measured by the adversaries and leads to leakage in sensitive information. According to the research, Spectre attacks can be exploited using both native codes and JavaScript.
There are two variants of Spectre attacks. Variant 1 exploits conditional branches speculation and can be used to bypass software sanitization of offsets. Figure 1 shows a code example of variant 1. During the speculative execution, a processor may mispredict the direction of the conditional branch and make the processor speculatively execute the incorrect path of codes. After the speculative execution, the cache line holding arr2-> data[index2] remains in the cache. The attackers can then determine whether the value of index2 was 0x200 or 0x300 by timing the loading of arr2-> data[0x200] and arr2-> data[0x300], which can be used to deduce whether arr1-> data[untrusted_offset]&1 is 0 or 1.
Variant 2 of Spectre attacks aims at indirect branches. To reduce the branch penalty, modern processors introduce the BTB which is a branch-prediction cache that stores the predicted address for the next instruction after a branch. Prior research [21] shows that attacks can be performed by abusing BTB, since it does not take the privilege level into account. Considering speculation with branch prediction, attackers in the unprivileged level can collide the BTB entry which is selected by a part of branch instruction address. Thus, the BTB can be trained to mispredict an indirect branch to the address which points to a piece of attacker-controlled code. While the speculation can be reverted, their effects on the cache may be retained. These effects can be used to leak sensitive information.

Existing Mitigations
Since the Spectre is rooted in the hardware itself, the final solution should count on the update designs of the processors.
Mitigating conditional branch (variant 1) requires that speculation is halted on potentially sensitive execution paths, which usually exist in JIT engine. A static analysis tool could be helpful to excavate potential vulnerable pattern among the codes. After the pattern is located, a barrier can be inserted to stop speculation. For x86 system, the LFENCE instruction is desired to perform serializing which stop younger instructions from executing, even speculatively, before older instructions have retired [22,23]. AMD also proposed a method that creates a data dependency to avoid speculative execution [23]. Dong et al. [24] proposed an instrumentation-based bit masking approach with Intel MPX mechanism to defense against in-kernel speculation side channel. ARM suggests that inserting the combination instructions of an DSB SYS and an ISB to prevent speculation in the old system and proposes a new instruction CSDB which has better performance in the next-generation architecture [25].
Indirect branch poisoning (variant 2) is considered to be more challenging to mitigate. It is widely believed that new hardware mechanisms are required to fix the issue at a minimum cost. Intel suggests a new interface between the processor and system software that allows system software to prevent an attacker from controlling the victim's indirect branch prediction which requires both updated system software and microcode [22]. AMD defined additional steps through a combination of processor microcode updates and operating system patches to mitigate the threat [23]. In addition, it also introduces Indirect Branch Prediction Barrier (IBPB), Indirect Branch Restricted Speculation (IBRS) and Single Thread Indirect Branch Prediction mode (STIBP) for the next generation of processors to fix the issues [26]. Although there are plenty of implementations for ARM architecture, which means only part of them might be affected and makes no generic mitigation that applies for all ARM processors, many ARM processors have implementationspecific controls that can be used either to disable branch prediction or to provide mechanisms to invalidate the branch predictor [25]. Besides hardware-side mitigation, there is also a pure software mitigation of variant 2 called Retpoline [8,9] developed by Google which is said to have an observed negligible impact on performance. In the following sections, we will analyze it in detail.

RETPOLINE TECHNIQUES
Retpoline is a portmanteau of "return" and "trampoline". In this section, we first describe the concept of trampoline technique. And then we analyze the implementation details of Retpoline and how it protects the Linux operating system from variant 2.

Trampoline
A trampoline, also known as thunk [27], refers to a small piece of code called as a function, does some small work, and then jumps to another location ( Figure 2). Unlike ordinary function call, it does not return to its caller after execution, as shown in Figure 3. If the jump target of the trampoline is a function which means it will jump back to some location, it will return to the trampoline's caller. It can be used as calling convention translation, virtual function translation or dynamic closures.

Retpoline: An Indirect Branch Trampoline
With speculation, the factors that may determine the following execution sequence of an indirect branch depend on two parts. One is the final branch target address, the other one is the result of branch predictors. Among all indirect branch instructions, function return instruction has a unique character. Recall that function return is itself an indirect branch. Unlike ordinary indirection branches, the speculation path of function return is determined not by itself but by the paired function call. It is generally assumed that a function should return to the address next to the function call instruction. Thus, the prediction target of function return is determined by the function call and are recorded in return stack buffer (RSB) separately [28], which means that BTB collision poisoning cannot affect it. However, the actual destination of function Retpoline sequences are a software construct that allows indirect branches to be isolated from speculative execution. It utilizes the property of function return and provides a controlled speculative execution for indirect branches. It uses properties of the RSB, which is filled with safe targets on entry to a privileged mode, to control speculation. RSB is located in dispatch stage and stores return address when CALL instruction is dispatched and provides return address when RET instruction is dispatched. Contents of RSB are filled once the CALL instruction is dispatched. As shown in Figure 3, Retpoline first invokes a CALL to set_up_target which sets the prediction target of RET to capture_spec. Then it modifies the return target address to the content of r11 on the stack and dispatch RET. Before the final branch is  Measures the rate at which data can be transferred from one file to another, using various buffer sizes Pipe Throughput The number of times per second a process can write to a pipe read them back Pipe-based Context Switching Measures the number of times two processes can exchange an increasing integer through a pipe Process Creation Measure the number of times a process can fork and reap a child that immediately exits Shell Scripts Measure the number of times per minute a process can start and reap a set of eight concurrent copies of a shell scripts where the shell script applies a series of transformation to a data file System Call Overhead Estimates the cost of entering and leaving the operating system kernel done by RET, the processor would speculatively execute the loop pointed by capture_spec.

Protecting Linux with Retpoline
Applying Retpoline to protect from Spectre Variant 2 means using Retpoline construction to replace indirect branch instructions. Converting all indirect branches in the kernel requires compiler support [29,30] which inserts Retpoline instead of indirect branch instructions in the backend. Besides, there are some fragments in the kernel implemented in the assembly. These include system call entries, IRQ inline assembly, hyper calls and so on, which consist of a basic boundary for context switch. Without compiler support, it can still make up a minimum protection against Variant 2 attacks in the kernel.

Methodology
To further understand Retpoline, we have run several benchmarks to evaluate its impact on performance. We choose a main stream Intel platform (Core i7-3770@3.40 GHz) as our target hardware. We fetch and build the latest v4.15 Linux kernel with GCC 7.3.0 from Git repository with and without the Retpoline configuration. Under these two kernels, we launch various tests targeting performance from I/O to userspace application.
We first run UnixBench to get a generic view of system performance. UnixBench provides a basic indicator of the performance of a Unix-like operating system and is widely used to evaluate system performance. Table 1 shows the details of UnixBench test suit. For I/O performance, dbench is chosen to evaluate storage workloads while tbench is used to measure network stack. Dbench is an emulation of the file system load function of the Netbench benchmark. It does all the same I/O calls that a Samba server would produce. While dbench produces only the file system load, it does no networking calls. In order to simulate networking loads, tbench is developed along with dbench. It produces only the TCP and process load which does the same socket calls that the Samba server would do under a netbench load. This benchmark is a good simulation of a real server setup, in which a large amount of files with different sizes and directions have to be created, written, read and deleted. Both tests are run for ten times and then are taken the average for comparison. Finally, we run SPECCPU2006 under those two kernels to survey the impact of in-kernel Retpoline to pure computing in userspace. The SPECCPU2006 benchmark is an industry-standard benchmark that is designed to compare the performance of servers. The workloads do not require extremely large amounts of memory, and since there is essentially no disk or network I/O. It measures performance on two different basic workloads: one based on compute-intensive integer math (Table 2), and the other based on compute-intensive floating-point math (Table 3). In general, the integer workload maps to the performance of business applications found in the data center, and the floating-point workload maps to scientific calculations more often found in a highperformance computing environment.
In addition, we rebuild SPECCPU2006 with Retpoline support in userspace to evaluate performance impact, since Spectre may be exploited to attack not  only kernel but also userspace applications such as JIT engine. To build userspace with Retpoline, option "-mindirect-branch = thunk-inline" should be used instead of "-mindirect-branch = thunkextern" which makes GCC insert its own Retpoline implementation rather than linking the external implementation in Linux kernel.

Results
For comparison, we normalized all the results. Figure 4 shows the result of UnixBench which provides a basic indicator of the performance of a Unix-like system in various aspects. Not all the micro benchmarks are affected by the Retpoline kernel. Tests such as Dhrystone2 and Whetstone that targets at pure computing show almost the same under either of the kernel. Those tests such as file copy, pipe operation and system call that depend on the kernel function suffer from Retpoline kernel in various degrees.
As shown in Figure 5 Retpoline kernel influences I/O subsystem of kernel. While there is little overhead on storage, regression on network is more remarkable. TCP We have run SPECCPU2006 for three times under different setups. Firstly, it is run on the 4.15.0 Linux kernel without any Retpoline options, of which the result would be used as the baseline. Secondly, we launch the benchmark on the 4.15.0 Linux kernel with Retpoline enabled (aka in-kernel Retpoline). In the end, we rebuild the SPECCPU2006 test set with Retpoline compiler options for userspace application and benchmark under the Retpoline-enabled kernel. As we can see from Figure 6, in-kernel Retpoline has little impacts on the results of SPECCPU which indicates that computeintensive application may be run unaffected with the mitigation. However, applying Retpoline in userspace which may be used to reinforce Spectre-affected applications could suffer from performance reduction in certain areas.
Since there is little regression on floating point applications, it would greatly impact certain integer tests. Figures 7 and 8 show the details of SPECCPU2006 testing items. For SPECint2006, 400.perlbench is affected most which shows less than 60% performance when applying Retpoline in userspace. Considering that interpreter along with JIT engine is one of the targeted software, it is worth to carefully adopt Retpoline as the mitigation.

Userspace Network I/O
Previous report [31] shows that networking-related CPU overheads for a kernel-based TCP stack can be up to 40% for application context switching. According to the benchmark results, there is a significant impact on network I/O but negligible regression on filesystem and storage due to the protocol processing in kernel which use indirect branches more than storage system. Userspace network stack has been proposed to improve the cache performance on multi-core systems [32], meanwhile avoiding extra data copies and boundary crossings. Since we can apply Retpoline to kernel and leave userspace network driver speculatively executing indirect branch as usual, performance can benefit from userspace network stack. Various frameworks such as Netmap provide efficient packet reception and transmission mechanisms to or from user space bypassing kernel stack packet processing. These frameworks reduce or remove various packet processing costs such as per-packet dynamic memory allocations, system call overheads and memory copies to userspace.
Processed by the kernel, a packet arrives at the network card's interrupt handler. As soon as the driver acknowledges the interrupt controller, the remainder of the work may well be performed in a tasklet. Later, the core network processing happens in another software interrupt which would consist of different stages in multiple layers. In the end, the data are copied to the userspace resulting in context switch. A whole network transaction would cover a series of packet exchanges through multiple system calls which includes not only I/O functions but also security audits. This process would pass through much more indirect branches than others. Thus, it suffers more regression from Retpoline which prevents speculation of indirect branches. In this case, we propose to use a userspace network stack to alleviate such regression. Netmap maps NIC buffers directly into userspace memory and avoids the memory copy between the kernel and userspace applications. Thus, the application can access a packet as simple as reading from the buffer in the mapped region. Once the buffers are ready for the NIC, the application can then initiate DMA operations of the NIC through ioctl. To certify our proposal, we measure the throughput of Netmap under the two kernels. Figure 9 shows the network throughput when applied Netmap network stack. Netmap with in-kernel Retpoline achieves almost the same throughput compared to the non-Retpoline system.

DISCUSSIONS
As we saw in the previous section, Retpoline does not bring about severe regression to the whole system in general. However, it does impact performance in certain circumstances, especially when applied to userspace program. Since Spectre does not only affect the kernel, userspace program such as JIT engine may also require for Retpoline under certain conditions.
To carry out attacks against Spectre, two conditions should be met, which are speculative execution and side channel. Retpoline belongs to a kind of techniques that prevent speculative execution. As we have seen in this paper, fully protection by Retpoline also causes some regression. To cut down regression, one way is to restrict the scope of the impact, for example, limiting uses of Retpoline only in kernel, or even only in some key path of context switching. In the same vein, cutting off the side channel may have similar effects. Since Spectre relies on the side channel of caches which plays an important role in the performance on modern CPU, it would be unacceptable to shut down this side channel to prevent attacks. However, if we can identify that only a small portion of code may suffer from the vulnerability and it is not in the critical path, shutting down the side channel can also be an option.

CONCLUSION
Exploiting speculative execution to perform a sidechannel attacks arises a new type of threat which can be used to leak sensitive information unintentionally from a privileged domain regarded as safe previously. Since speculation is widely used in modern superscalar processors, these vulnerabilities are found in many popular processors including Intel, AMD and some implementations of ARM. While the hardware providers are trying to fix the issues from the microarchitecture designs, software mitigations are always desirable since it is much cheaper and can be applied in legacy system. In this paper, we evaluate one of the software mitigations called Retpoline that can be used to protect from attacks which exploit indirect branch speculation. We found that inkernel Retpoline has an impact on performance to the existing software but varies depending on how applications interact with the kernel. To alleviate the impact, we propose a method that uses userspace network stack. We evaluate it using Netmap userspace packet I/O framework and verify the proposal. Besides, we observe great performance regression of applying Retpoline in some of userspace applications such as Perl interpreter which is thought to be the targets exploited by the new type of attacks.