<p style="text-align: center;"><sub><i>Original Date Published: September 8, 2022</i></sub></p> ![[firecracker.png]] This blog post covers attacking a vulnerability in [Firecracker](https://firecracker-microvm.github.io/), an open source micro-virtual machine ([microVM](https://web.archive.org/web/20221001182026/https://qemu.readthedocs.io/en/latest/system/i386/microvm.html)) monitor written in the [Rust programming language.](https://www.rust-lang.org/) It was developed for use in [AWS Lambda](https://aws.amazon.com/lambda/), a serverless software-as-a-service (SaaS) application hosting service. Firecracker is also used for AWS’ similar [Fargate](https://aws.amazon.com/fargate/) service that provides a way to run containers without having to manage servers for container orchestration. Due to the risks that are introduced via [multi-tenancy](https://www.techtarget.com/whatis/definition/multi-tenancy), Firecracker was intentionally designed with security mind. In this post, we’ll cover the following topics: - What is Firecracker? - Why attack it? - How does it work? - Root cause analysis of a memory corruption vulnerability, [CVE-2019-18960](https://nvd.nist.gov/vuln/detail/CVE-2019-18960) - Exploit primitives and analysis of exploitability - Reflections and takeaways as they relate to security I had no knowledge of Firecracker (or Rust) prior to conducting this research. My hope is that this post will be useful for those wanted to learn about virtualization, Firecracker, KVM and provide some clarity on the various layers of virtualization and VM escape exploitation. # Firecracker: What is it? Firecracker is an open source virtual machine monitor (VMM) created and maintained by Amazon Web Services (AWS). Per Amazon’s website, Firecracker is a “new virtualization and open source technology that enables service owners to operate secure multi-tenant container-based services by combining the speed, resource efficiency, and performance enabled by containers with the security and isolation offered by traditional VMs.” [ [1]](https://aws.amazon.com/about-aws/whats-new/2018/11/firecracker-lightweight-virtualization-for-serverless-computing/). Firecracker is comparable to [QEMU-KVM](https://www.qemu.org/); they are both VMMs that utilize [KVM](https://web.archive.org/web/20221001182026/https://ubuntu.com/blog/kvm-hyphervisor), a hypervisor built into the Linux kernel. Firecracker was designed to prioritize security and efficiency for serverless workloads. This led to some key design differences to QEMU.  Firecracker is much less flexible than QEMU. In order to minimize complexity and attack surface, Firecracker forgoes non-essential functionality. QEMU, on the other hand, has had [many vulnerabilities](https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=qemu) arise from complex device implementations. ## Why Attack Firecracker? Technology like Firecracker is of particular interest to companies with a multi-tenant system with customer provided code execution. Therefore, it is of upmost importance that multi-tenant boundaries can not be violated. Firecracker is used by AWS to isolate runtimes from each other. Before deciding to use Firecracker in production, conducting a security review of the product to evaluate whether it was appropriate for the use case, is a good idea. Conducting offensive driven research is beneficial to evaluate hardening measures that are effective and worthwhile to implement in an environment. Enforcing more constraints on the application (such as execution time, resource usage, credential limitations, the files available to it, etc.) can reduce the attack surface. This research came as a result of a security review. ## How Does it Work? First, I’ll briefly explain generally how a virtual machine monitor (VMM) uses KVM and then get into the specifics of Firecracker. ### KVM KVM (Kernel-based Virtual Machine) is a [type-1 hypervisor](https://medium.com/teamresellerclub/type-1-and-type-2-hypervisors-what-makes-them-different-6a1755d6ae2c) built into the Linux kernel (for `x86`) that allows a host to run multiple isolated virtual machines. It consists of two loadable kernel modules. The first, `kvm.ko`, provides the virtualization infrastructure. The second is a processor specific module (for either Intel or AMD) which takes a slice of the host’s physical CPU and maps it directly to the guest’s virtual CPU. Each guest VM runs as a regular Linux process in the host. KVM in the kernel exposes a low level [API](https://www.kernel.org/doc/html/latest/virt/kvm/api.html) to user space processes via [`ioctl`s](https://man7.org/linux/man-pages/man2/ioctl.2.html) to the `/dev/kvm` device. Through this API, the VMM user space process can create new VMs, assign vCPUs and physical memory, and intercept I/O or memory accesses to provide the guest access to emulated or virtualization-aware hardware devices [ [2]](https://googleprojectzero.blogspot.com/2021/06/an-epyc-escape-case-study-of-kvm.html). ![[63190c1ca347abfefb2a7770_firecracker_1 (Copy) (1).svg]] <p style="text-align: center;"><sub><i>Illustration of KVM virtualization</i></sub></p> ### Firecracker Design Firecracker is a VMM that uses the Linux Kernel’s KVM virtualization infrastructure to provide Linux and OSv microVMs on Linux hosts. On the host, there is one Firecracker process per microVM. There were some important design decisions with respect to security. The goal of Firecracker is to be a minimal VMM, so it only provides a _limited_ number of emulated devices. These devices are: block storage (`virtio-blk`), network (`virtio-net`), vsock (`virtio-vsock`), balloon driver (`virtio-balloon`), a serial console, and a partial `I8042` keyboard controller used only to stop the VM [ [4]](https://www.talhoffman.com/2021/07/18/firecracker-internals/). For comparison, QEMU has support for over 40 emulated devices, from which [vulnerabilities](https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=qemu) are reported often. Storage is done via block device rather than file system passthrough, to avoid giving the guest access to the host’s Linux kernel filesystem code, which is complex (and often has exploitable bugs). Firecracker also exposes a `REST` based configuration API over UNIX domain socket [ [3]](https://assets.amazon.science/96/c6/302e527240a3b1f86c86c3e8fc3d/firecracker-lightweight-virtualization-for-serverless-applications.pdf). The Firecracker `virtio-vsock` design, to support host-guest communication via socket, is also security conscious. The standard way is to use `vhost` (like what QEMU does), which requires a guest to pass data directly to a `vhost` kernel module on the host. Instead, Firecracker has its own `vsock` device as a backend to avoid exposing this additional attack surface. I will describe this design more in detail in the next section. Firecracker can be further constrained using the [jailer](https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md) program, which applies a set of sandboxing restrictions (such as [`seccomp`](https://en.wikipedia.org/wiki/Seccomp)) to the process. ![[631916976b4fff56992d7feb_Firecracker_2 (2).svg]] <p style="text-align: center;"><sub><i>Illustration of Firecracker's Design</i></sub></p> #### virtio-vsock The vulnerability we’ll discuss is found in the `vsock` implementation of Firecracker. I will explain this design a bit more in depth in the current section. `virtio-vsock` is a guest/host communication device that allows applications on the guest and host to communicate via socket [ [5]](https://wiki.qemu.org/Features/VirtioVsock#:~:text=virtio-vsock%20is%20a%20host,-agent%20or%20SPICE%20vdagent)). The standard way of implementing `vsock`, like what is done by QEMU, is by using the `vhost-vsock` kernel module. The [`vhost-vsock`](https://chromium.googlesource.com/chromiumos/platform2/+/9e91613d2da1b3d6cfb1c77681444e688ce99cf4/vm_tools/docs/vsock.md) kernel module provides `virtio` device emulation in the kernel, handling the communication with the guest [ [6]](https://stefano-garzarella.github.io/posts/2019-11-08-kvmforum-2019-vsock/). This allows the guest to pass untrusted data directly to a module running on the host’s kernel. Firecracker, on the other hand, emulates the `virtio-vsock` device itself in user space, implementing the device model over [ MMIO](https://en.wikipedia.org/wiki/Memory-mapped_I/O). The `vsock` device is exposed to the host via a `UNIX` socket. Firecracker mediates communication between an `AF_VSOCK` socket (on the guest end) and an `AF_UNIX` socket (on the host end) [ [7]](https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md). This solution has the advantage of avoiding a new kernel attack surface and there’s also less dependency on host kernel features, like `vhost`. ![[63191e3fda2ebff6cb726dca_Firecracker_3` (2).svg]] <p style="text-align: center;"><sub><i>Illustration of Firecracker's virtio-vsock Design</i></sub></p> # The Vulnerability There have only been three CVEs registered for Firecracker since its creation, and only one that can potentially lead to RCE on the host. In addition to being an RCE vulnerability, I chose to look at [CVE-2019-18960](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-18960) because it is a memory corruption vulnerability. Being completely new to Rust, I thought it would be worthwhile to examine how memory corruption vulnerabilities can still occur in a memory safe language. The vulnerability is found in the `vsock` device implementation of Firecracker. As explained in a previous section, Firecracker implements the `virtio-vsock` device model over MMIO. That means that Firecracker reads directly from the guest’s memory, which also resides in Firecracker’s own process memory. When a VM is created, Firecracker allocates the memory used for the guest’s RAM [using mmap](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/memory_model/src/mmap.rs#L59). This is represented by a vector of [MemoryRegion](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/memory_model/src/guest_memory.rs#L40) structures. ``` pub struct MemoryRegion { mapping: MemoryMapping, guest_base: GuestAddress, } ``` Here `guest_base`, a `GuestAddress` structure, stores a 64 bit base physical address on the guest. The `MemoryMapping` structure, `mapping`, stores a pointer to the associated memory in the Firecracker process along with the size. Firecracker performs I/O on the vsock device using the standard `virtio` [interface](https://developer.ibm.com/articles/l-virtio/). The drivers running in the guest’s kernel communicate with Firecracker through shared buffers. The guest allocates one or more buffers representing the request, registers these buffers with a descriptor table (an array), and signals that the buffers are ready to be consumed via a ring data structure (called a `virtqueue`). Each index of a descriptor table contains a descriptor which contains information about the guest provided buffer [ [9]](https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html#fn:footnote-virtio). ``` struct Descriptor { addr: u64, len: u32, flags: u16, next: u16, } ``` If specified in `flags`, descriptors can be chained together with next containing the descriptor table index of the chained descriptor. `virtio-vsock`, buffers in a descriptor chain are used to construct a `vsock` packet. Something to note at this point: the buffer information in the descriptor comes from the guest, and it should be treated as untrusted. When creating a new [`DescriptorChain`](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/devices/src/virtio/queue.rs#L38), the function [`is_valid`](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/devices/src/virtio/queue.rs#L108) is called. Here is where `addr` and `len` are checked to make sure the buffer received from the guest is valid. ``` fn is_valid(&self;) -> bool { !(self .mem .checked_offset(self.addr, self.len as usize) .is_none() || (self.has_next() && self.next >= self.queue_size)) } ``` Let’s take a look at the [`checked_offset`](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/memory_model/src/guest_memory.rs#L126) function. ``` /// Returns the address plus the offset if it is in range. pub fn checked_offset(&self;, base: GuestAddress, offset: usize) -> Option<GuestAddress> { if let Some(addr) = base.checked_add(offset) { for region in self.regions.iter() { if addr >= region.guest_base && addr < region_end(region) { return Some(addr); } } } None } ``` In the code snippet above on line 3, the `base` address is being added to offset (size of the I/O buffer in this case) to check if resulting address results in an integer overflow. If that check passes, the guest’s `MemoryRegions` are iterated through to see if the resulting address falls within a valid region. However, this check is not sufficient. There are two problems that could occur; the base and result address may belong to two different regions, and the base address may not even exist in a valid region. ![[63192c1437e439d55179d75f_Firecracker_4 (3).svg]] <p style="text-align: center;"><sub><i>Illustration of the first bug corresponding to CVE-2019-18960</i></sub></p> Now, for this bug to be exploitable, we need a way for the out of bounds buffer to be used. That is where `vsock` comes in. Recall that vsock packets are constructed from descriptor chains. Let’s look at the [`VsockPacket`](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/devices/src/virtio/vsock/packet.rs#L93) structure and how it is [created](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/devices/src/virtio/vsock/packet.rs#L106). The first descriptor buffer in a descriptor chain will contain the packet header, and the following contains the packet data. Both the header and the data to both are stored as raw pointers, along with the packet size inside the `VsockPacket` structure. ``` pub struct VsockPacket { hdr: *mut u8, buf: Option<*mut u8>, buf_size: usize, } ``` The pointers to both are copied into the structure after being returned from [`get_host_address`](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/memory_model/src/guest_memory.rs#L386). ``` let mut pkt = Self { hdr: head .mem .get_host_address(head.addr) .map_err(VsockError::GuestMemory)? as *mut u8, buf: None, buf_size: 0, }; pkt.buf_size = buf_desc.len as usize; pkt.buf = Some( buf_desc .mem .get_host_address(buf_desc.addr) .map_err(VsockError::GuestMemory)? as *mut u8, ); ``` The `get_host_address` function takes a physical address from the guest and returns the corresponding address in the Firecracker process’ memory. ``` pub fn get_host_address(&self;, guest_addr: GuestAddress) -> Result<*const u8> { self.do_in_region(guest_addr, 1, |mapping, offset| { // This is safe; `do_in_region` already checks that offset is in // bounds. Ok(unsafe { mapping.as_ptr().add(offset) } as *const u8) }) } ``` A memory region base address and the offset of the guest address from the base is calculated in [`do_in_region`](https://github.com/firecracker-microvm/firecracker/blob/effaab05e4b02b25c578273d966e04c98d2cf2e1/memory_model/src/guest_memory.rs#L434%22) and the addition of the two is returned as the resulting pointer. On line 5 in the code snippet above, there is an `unsafe` block. In Rust, a block of code can be prefixed with the `unsafe` keyword to permit operations such as dereferencing a raw pointer, reading or writing to a mutable static variable, accessing a field of a union (other than to assign it), or calling an `unsafe` function [ [10]](https://doc.rust-lang.org/reference/unsafety.html). In the code snippet above, the comment states that the operation in the `unsafe` block is safe to allow because `do_in_region` checks that the offset is in bounds. Let’s take a look: ``` fn do_in_region<F, T>(&self;, guest_addr: GuestAddress, size: usize, cb: F) -> Result<T> where F: FnOnce(&MemoryMapping;, usize) -> Result<T;>, { for region in self.regions.iter() { if guest_addr >= region.guest_base && guest_addr < region_end(region) { let offset = guest_addr.offset_from(region.guest_base); if size <= region.mapping.size() - offset { return cb(&region.mapping, offset); } break; } } Err(Error::InvalidGuestAddressRange(guest_addr, size)) } ``` As seen above, there is a bounds check performed. The function takes a parameter, `size`, and checks if the size of the buffer fits within the region. This ensures that the pointer being returned has space inside the `MemoryRegion` for the expected amount of memory that will be accessed. Now referring back to the calling function, `get_host_address`, note that `1` is always passed in as the size, instead of the actual size of the corresponding buffer. This means that as long as the buffer address starts in a valid region, it can overrun the region if its size is large enough. Due to the first check in `checked_offset`, the overrun has to end in a valid memory region to get this far, though. ![[63193c5a745f32dd9a020ae8_Firecracker_5 (2).svg]] <p style="text-align: center;"><sub><i>Illustration of the second bug corresponding to CVE-2019-18960</i></sub></p> This is interesting, because without this second bug, the previously discussed bug would not be exploitable. Now after constructing a `VsockPacket`, the raw pointer stored in `buf` will be used to do read/write operations with the packet data to manage communications with the UNIX socket on the host. This can be used to obtain a read/write primitive outside of the guest’s memory space within the Firecracker process. # Exploit Primitives To exploit this vulnerability an attacker has to have kernel execution in a guest VM. This is in order to execute at the level of the guest’s `virtio-vsock` driver. The first step of writing an exploit for this vulnerability is to write a kernel module to trigger it. The module has to register an invalid buffer with the `vsock` device. This is done by writing an invalid address and length combination in a descriptor table entry. Before beginning to write code, I wanted to first look at what exploit primitives can be constructed with the vulnerability, theoretically. I had some concerns: - The area of out of bound’s memory that can be read/written to is limited to a specific area. and - Runtime mitigations in Rust are restrictive. The first step is to investigate the area of memory that can be controlled. To trigger the vulnerability, there must be at least more than one `MemoryRegion` associated with a guest’s memory space. Let’s look at how the regions are created for `x86_64` VMs: ``` const MEM_32BIT_GAP_SIZE: usize = (768 << 20); /// Returns a Vec of the valid memory addresses. /// These should be used to configure the GuestMemory structure for the platform. /// For x86_64 all addresses are valid from the start of the kernel except a /// carve out at the end of 32bit address space. pub fn arch_memory_regions(size: usize) -> Vec<(GuestAddress, usize)> { let memory_gap_start = GuestAddress(FIRST_ADDR_PAST_32BITS - MEM_32BIT_GAP_SIZE); let memory_gap_end = GuestAddress(FIRST_ADDR_PAST_32BITS); let requested_memory_size = GuestAddress(size); let mut regions = Vec::new(); // case1: guest memory fits before the gap if requested_memory_size <= memory_gap_start { regions.push((GuestAddress(0), size)); // case2: guest memory extends beyond the gap } else { // push memory before the gap regions.push((GuestAddress(0), memory_gap_start.offset())); regions.push(( memory_gap_end, requested_memory_size.offset_from(memory_gap_start), )); } regions } ``` Here we can see that if the guest requires more than `0xD0000000` bytes of memory, a second `MemoryRegion` is created for the remaining memory. I also looked at the `aarch64` implementation, but it’s not possible to trigger the creation of more than one `MemoryRegion` for a VM in that architecture. With this information, we know what to do: create a buffer descriptor with a physical address lower than the boundary of the first `MemoryRegion` (`0xD0000000`) and provide a length that overruns this address. The diagram below shows the basic exploit primitive we can theoretically achieve: ![[631951e88f55ac3b1c57a2fd_Firecracker_6 (3).svg]] <p style="text-align: center;"><sub><i>Illustration of the exploit primitive that can be gained with CVE-2019-18960</i></sub></p> ## Exploitability In order to evaluate the exploitability of this vulnerability we need to investigate what memory can be accessed with the exploit primitive. To answer this question, I did some debugging from within Firecracker. First, I configured a Firecracker microVM to require enough memory to create two `MemoryRegions` and printed their addresses during runtime. Below is a screenshot of Firecracker’s memory map after the `MemoryRegion`s have been created for the guest. ![[Pasted image 20230315131349.png]] <p style="text-align: center;"><sub><i>Memory map indicating overwriteable portions of memory by attacker</i></sub></p> Note that the mappings for the two `MemoryRegion`s are contiguous. However, the mapping for the first `MemoryRegion` occurs at a higher address than the second `MemoryRegion`. Since our exploit primitive gives us the ability to overflow the mapping for the first `MemoryRegion`, we have the ability to overwrite at addresses higher than `0x7f1b3f118000` in the Firecracker process*. There are some interesting areas of memory, such as the stack, that reside at higher addresses in the process. However, the pages mapped at address `0x7f1b3f11a000` are marked with `PROT_NONE` permissions, and act as a guard page. This means that we cannot overwrite onto the stack - if we have to do a contiguous write beginning from within the first `MemoryRegion` mapping we will [segfault](https://en.wikipedia.org/wiki/Segmentation_fault). This gives us Denial-of-Service (DoS) of the Firecracker process, which isn’t very powerful if the attacker already has guest kernel execution. I looked further into how `MemoryRegion`s are mapped, and found nothing that would help gain a more favorable allocation. I dumped the limited accessible area of memory at `0x7f1b3f118000-0x7f1b3f11a00` and found it was entirely `NULL` bytes. My inclination is that it is unlikely there is anything of interest there. Since this vulnerability has been patched, the `MMIO` code that Firecracker uses has been overhauled. Now, “guard” pages are created to surround every guest memory region. [The guard region is mapped](hhttps://github.com/firecracker-microvm/firecracker/blob/2a5a6bc7155959d73f76f2af15125b7fd2798013/src/vm-memory/src/lib.rs#L27) with `PROT_NONE`, so that any access to this region will cause a `SIGSEGV` segfault. This mitigation protects against the exploitation of the exact type of vulnerability we are trying to exploit here. While the aforementioned protection hadn’t been implemented at the time this vulnerability was patched, it’s an interesting coincidence that a guard page is inhibiting exploitation. The guard page in this case is being mapped somewhere else, possibly at the time the `ELF` loading. I looked the memory maps of the other processes on the Firecracker host machine and they did not consistently have guard (`PROT_NONE`) mappings. To further experiment, I wrote a small Rust program and saw that it did had a guard mapping in its memory map, albeit of a different size. I speculate it comes as a result of some sort of Rust mitigation. This creates at a big road block for exploitation as the memory we can overflow into doesn’t contain anything interesting. At this time I decided to move on, but I have some ideas if I were to continue. Out of curiosity, I would do more analysis to figure out what is creating the mystery guard page. I would also try to see if triggering an offset copy is possible. That is, a way such that the `VsockPacket`'s data buffer is accessed at an offset, and miss the guard page completely. `VsockPacket`s are exchanged to and from Firecracker’s vsock backend which manages the UNIX socket on the host. I would analyze this part of the code to find other possible primitives. I encourage anyone interested to pick up where I left off on this exploit and share their ideas. _\*The size of the overflow is restricted to `vsock` packet size limits, among other things_. # Hardening While Firecracker’s design is security focused, there are a some hardening measures that can be used to further lock down the attack surface. First, limit untrusted code to running with the lowest privileges possible. Additionally, hardening the guest operating system and running a fully patched kernel is crucial. Without guest kernel execution, an attacker has no way to exploit the vulnerability covered in this post. The primary recommendation from the authors of Firecracker is to use [jailer](https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md), a program designed to isolate the Firecracker process in order to enhance security. In the case of exploiting the discussed vulnerability, a takeover of the Firecracker process yields a restrictive execution environment. An attacker would need to bypass all the restrictions imposed by jailer to escalate privileges and execute outside of the Firecracker process. Read a step by step account of what the jailer program does on startup [here](https://github.com/firecracker-microvm/firecracker/blob/main/docs/jailer.md). Among the things jailer does is load a [`seccomp`](https://en.wikipedia.org/wiki/Seccomp) filter for Firecracker, with a per thread profile. This means the different threads in the Firecracker process have different set of system calls that can be called from within the context, depending on the thread’s job. This is nice, but an attacker already in the Firecracker process can trivially hijack another thread that has access to different system calls. Therefore, jailer’s `seccomp` policy should be treated as a union of all of the thread’s allowable system calls. Currently, `io_uring` system calls are included in Firecracker’s `seccomp` filter. Because it redefines how system calls are executed, `io_uring` offers a seccomp bypass for the supported system calls. This is because `seccomp` filtering occurs on system call entry after a thread [context switch](https://web.archive.org/web/20221001182026/https://www.geeksforgeeks.org/user-mode-and-kernel-mode-switching/), but system calls executed via `io_uring` do not go through the normal system call entry. Therefore, Firecracker’s `seccomp` policy should be treated as its union with all system calls supported by `io_uring`. # Security Reflections and Takeaways There are some of the major security takeaways gleaned from doing this short research project exploiting Firecracker: ## On the Kernel Kernel hardening and attack surface reduction is critical, despite the potential to impose restrictions on use or negatively impact performance. Given a Firecracker vulnerability like the one covered in this post, protecting the kernel prevents an attacker with access to the attack surface. If an attacker did successfully exploit this vulnerability, they would have access to the host and any other VMs executing on that host. Because of the nature system call filtering via `seccomp`, `io_uring` still presents a major security disruption in sandboxing. While it seems most appropriate to use LSM to restrict `io_uring`, that introduces requirements on the host that may be suboptimal. You can read more about `io_uring` in my blog post [here](https://chompie.rip/Blog+Posts/Put+an+io_uring+on+it+-+Exploiting+the+Linux+Kernel). ## On Firecracker Design The Firecracker team’s decision to forgo `vhost` and implement the back end resulted in a critical vulnerability being introduced. However, the same vulnerability would be much more critical if it were found in the `vhost` kernel code. Due to the relatively small size of the code base, the memory safety of Rust, the limited attack surface, and the newly introduced mitigations, it’s unlikely these types of vulnerabilities will be common or practically exploitable. ## On Rust Though Rust is a memory safe language, memory corruption vulnerabilities are still possible. Rust uses `&` references, which are like pointers in C ("raw pointers"), but with many restrictions that allow Rust to achieve memory safety [ [11]](https://doc.rust-lang.org/reference/types/pointer.html). However, Rust provides an escape hatch, the `unsafe` keyword, for bypassing these `&` restrictions. This is how Rust programs are able to call into native libraries and still able to validate the safety of `&` references in other parts of the Rust code. Rust does not permit converting a pointer returned from a C library to an `&` reference, because Rust is unable to validate the safety of the other library. These raw pointers are stored as `*const T` and `*mut T` in Rust, which we see in the vulnerable code snippets in this post. Given that the developer must explicitly tell Rust to avoid safety checks within `unsafe` blocks, it is the developer's responsibility to ensure the operations are safe in all possible cases. Although not all exploitable bugs are that of memory safety, an interesting project for a vulnerability researcher is to search for `unsafe` blocks in Rust codebases and look for cases where they can be abused. Code comments asserting the safety of these blocks are clues into the assumptions the developer has made, indicating exactly what should be checked. To this aim, a researcher might be interested in [cargo-geiger,](https://crates.io/crates/cargo-geiger) which can help identify `unsafe` blocks in a codebase as well as their dependencies. In a [recent blogpost](https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html), the Kani Rust Verifier was used to formally verify the correctness of Firecracker’s `virtio` device code, with respect to a simple `virtio` requirement. The proof is for a property described in the `virtio` [device specifiction](https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html), and is a requirement for the behavior of the guest’s `virtio` driver. Here, they prove the property is always upheld regardless of malicious device requests from the guest. An interesting experiment would be to repeat the process with all the requirements found in the `virtio` spec, in particular those that apply to the guest’s driver. ## Takeaways for Multi-tenant Architecture This research was critical to understanding what strategies work best for hardening multi-tenant architecture. Based on this work, the conclusion is there should be a focus on hardening the guest operating system. This limits an attacker’s ability to exploit the guest kernel, thus cutting off a considerable attack surface. As such, the recommendation remains restricting the process within the VM to make escalation to kernel more difficult. This involves leveraging multiple linux sandboxing primitives, primarily through [systemd’s native sandboxing features](https://www.redhat.com/sysadmin/mastering-systemd) features, including a restrictive `seccomp` filter. Given the difficulty in exploiting Firecracker _even with_ control over the kernel, the solutions suggested provide a solid foundation for security when used in multi-tenant architectures. # Acknowledgements [Ian Nickles](https://twitter.com/_inickles), for his help with instrumentation, Rust, and general research. [Andréa](https://twitter.com/and_zza), for her incredible work on the diagrams. [Max Wittek](https://twitter.com/wwiimmaaxx), for his help with Firecracker. # References 1. https://aws.amazon.com/about-aws/whats-new/2018/11/firecracker-lightweight-virtualization-for-serverless-computing/ 2. https://googleprojectzero.blogspot.com/2021/06/an-epyc-escape-case-study-of-kvm.html 3. https://assets.amazon.science/96/c6/302e527240a3b1f86c86c3e8fc3d/firecracker-lightweight-virtualization-for-serverless-applications.pdf 4. https://www.talhoffman.com/2021/07/18/firecracker-internals/ 5. https://wiki.qemu.org/Features/VirtioVsock#:~:text=virtio-vsock%20is%20a%20host,-agent%20or%20SPICE%20vdagent 6. https://stefano-garzarella.github.io/posts/2019-11-08-kvmforum-2019-vsock/ 7. https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md 8. https://developer.ibm.com/articles/l-virtio/ 9. https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html 10. https://doc.rust-lang.org/reference/unsafety.html 11. https://doc.rust-lang.org/reference/types/pointer.html