CVE-2021-41073 在安卓5.10内核上利用IO_URING类型混淆

2025-04-22

字数统计: 2.8k | 阅读时长≈ 12 分钟

第一次安卓实机内核利用, 感觉安卓上的利用确实还是比较受限的.
分析就简写了, 主要是利用.

分析

由于 io_uring 的异步性质，应用程序需要跟踪当前正在传输请求中的缓冲区, 这可能会比较复杂。IORING_OP_PROVIDE_BUFFERS 提供了自动管理缓冲区的功能. 首先通过IOURING_OP_PROVIDE_BUFFERS向内核提前注册一组缓冲区(buf_group), 组中的每一个缓冲区对应一个bid. 注册完成后, 可以通过在读写请求中带IOSQE_BUFFER_SELECT标志并指定 buf_group, 内核将自动在对应组中选择一个空闲的缓冲区进行传输, 完成后在CQE中给出选择的bid.

io_kiocb结构体是一次io_uring请求的内核表示, 结构体开头是联合体的形式, 用来存储不同类型请求需要的不同数据结构, 对于读写请求来说它是io_rw.

内核在使用io_rw这一结构体时, 其中的addr字段在不同情况下有不同意义, 可能指向一个用户缓冲区, 也可能指向io_buffer结构体(IOSQE_BUFFER_SELECT, 即由内核管理缓冲区). 这就可能引入混淆, 内核使用req->flags中的REQ_F_BUFFER_SELECTED标志来区分这两种情况.

如果内核在处理rw.addr之前没有检查这一标志, 就会出现类型混淆漏洞.
loop_rw_iter函数是实际执行读写请求的函数, 在每次循环初始化时, 通过iov_iter_is_bvec函数检查REQ_F_BUFFER_SELECTED标志并进行不同的处理, 这是标准的流程. 但在循环结束的步进位置却遗失了判断, 在req->rw.addr是io_buffer指针时也将其进行了偏移.

这会导致在随后的io_put_rw_kbuf函数中的偏移释放, 而这个偏移是用户可控的读写大小, 所以就有了一个任意偏移内存释放原语.

将漏洞模式提炼一下, 一个kmalloc得到的指针永远不应该在偏移后进行释放操作, 使用CodeQL对这样的行为进行建模, 确实找到了一些类似的一个字段多重意义的用法, 但可惜都进行了必要的检查.

利用

基本原语

值得一提的是这样一种任意偏移释放原语, 不同于传统的UAF, Double Free, 它可以很轻易的做到CrossCache. 但本次利用并没有采用CrossCache方案.

有了UAF之后就要找用来堆喷并读写数据的结构体, 常见的堆喷读取结构体是msg_msg和user_key_payload, 常见的写入结构体(其实是系统调用)是setxattr(需要配合userfaultfd和fuse, 因为用户数据位于临时对象中, 调用结束被释放), 但这些在安卓上都无法使用.

但有一种punching hole技术可以在安卓上近似替代userfaultfd和fuse.

punching hole

fallocate系统调用可以直接对磁盘空间进行操作. 默认行为是为文件分配磁盘空间, 以保证之后在指定范围内对文件的写入不会出现磁盘空间不足.

Allocating disk space
The default operation (i.e., mode is zero) of fallocate()
allocates the disk space within the range specified by offset and
size. The file size (as reported by stat(2)) will be changed if
offset+size is greater than the file size. Any subregion within
the range specified by offset and size that did not contain data
before the call will be initialized to zero. This default
behavior closely resembles the behavior of the posix_fallocate(3)
library function, and is intended as a method of optimally
implementing that function.
  After a successful call, subsequent writes into the range
   specified by offset and size are guaranteed not to fail because of
   lack of disk space.

有分配自然就有释放.

Deallocating file space
Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
2.6.38) in mode deallocates space (i.e., creates a hole) in the
byte range starting at offset and continuing for size bytes.
Within the specified range, partial filesystem blocks are zeroed,
and whole filesystem blocks are removed from the file. After a
successful call, subsequent reads from this range will return
zeros.

在shmem_fallocate执行FALLOC_FL_PUNCH_HOLE时, 会设置inode->private = &shmem_falloc, 在punch hole结束之前, 访问这片内存的线程会被加入一个等待队列中.

static long shmem_fallocate(struct file *file, int mode, loff_t offset,
							 loff_t len)
{
	struct inode *inode = file_inode(file);
	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
	struct shmem_inode_info *info = SHMEM_I(inode);
	struct shmem_falloc shmem_falloc;
	pgoff_t start, index, end, undo_fallocend;
	int error;

	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
		return -EOPNOTSUPP;

	inode_lock(inode);

	if (mode & FALLOC_FL_PUNCH_HOLE) {
		struct address_space *mapping = file->f_mapping;
		loff_t unmap_start = round_up(offset, PAGE_SIZE);
		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);

		/* protected by i_rwsem */
		if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
			error = -EPERM;
			goto out;
		}

		shmem_falloc.waitq = &shmem_falloc_waitq;
		shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT;
		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
		spin_lock(&inode->i_lock);
		inode->i_private = &shmem_falloc;
		spin_unlock(&inode->i_lock);

		if ((u64)unmap_end > (u64)unmap_start)
			unmap_mapping_range(mapping, unmap_start,
					    1 + unmap_end - unmap_start, 0);
		shmem_truncate_range(inode, offset, offset + len - 1);
		/* No need to unmap again: hole-punching leaves COWed pages */

		spin_lock(&inode->i_lock);
		inode->i_private = NULL;
		wake_up_all(&shmem_falloc_waitq);
		WARN_ON_ONCE(!list_empty(&shmem_falloc_waitq.head));
		spin_unlock(&inode->i_lock);
		error = 0;
		goto out;
	}

static vm_fault_t shmem_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct inode *inode = file_inode(vma->vm_file);
	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
	int err;
	vm_fault_t ret = VM_FAULT_LOCKED;

	/*
	 * Trinity finds that probing a hole which tmpfs is punching can
	 * prevent the hole-punch from ever completing: which in turn
	 * locks writers out with its hold on i_rwsem.  So refrain from
	 * faulting pages into the hole while it's being punched.  Although
	 * shmem_undo_range() does remove the additions, it may be unable to
	 * keep up, as each new page needs its own unmap_mapping_range() call,
	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
	 *
	 * It does not matter if we sometimes reach this check just before the
	 * hole-punch begins, so that one fault then races with the punch:
	 * we just need to make racing faults a rare case.
	 *
	 * The implementation below would be much simpler if we just used a
	 * standard mutex or completion: but we cannot take i_rwsem in fault,
	 * and bloating every shmem inode for this unlikely case would be sad.
	 */
	if (unlikely(inode->i_private)) {
		struct shmem_falloc *shmem_falloc;

		spin_lock(&inode->i_lock);
		shmem_falloc = inode->i_private;
		if (shmem_falloc &&
		    shmem_falloc->waitq &&
		    vmf->pgoff >= shmem_falloc->start &&
		    vmf->pgoff < shmem_falloc->next) {
			struct file *fpin;
			wait_queue_head_t *shmem_falloc_waitq;
			DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);

			ret = VM_FAULT_NOPAGE;
			fpin = maybe_unlock_mmap_for_io(vmf, NULL);
			if (fpin)
				ret = VM_FAULT_RETRY;

			shmem_falloc_waitq = shmem_falloc->waitq;
			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
					TASK_UNINTERRUPTIBLE);
			spin_unlock(&inode->i_lock);
			schedule();

			/*
			 * shmem_falloc_waitq points into the shmem_fallocate()
			 * stack of the hole-punching task: shmem_falloc_waitq
			 * is usually invalid by the time we reach here, but
			 * finish_wait() does not dereference it in that case;
			 * though i_lock needed lest racing with wake_up_all().
			 */
			spin_lock(&inode->i_lock);
			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
			spin_unlock(&inode->i_lock);

			if (fpin)
				fput(fpin);
			return ret;
		}
		spin_unlock(&inode->i_lock);
	}

这给了一个”内核线程阻塞在读写用户空间数据”的原语, 类似fuse/userfaultfd, 且在安卓上能触发.
阻塞时间长短取决于fallocate的大小(应该是与反向映射的数量有关).

在之前的尝试中, 笔者一直认为shmem_sz的上限是0x80000000,因为超过这个值会报EINVAL.

1 2	#define PAGE_SIZE 0x1000 size_t shmem_sz = (0x1000 * 0x7f) * PAGE_SIZE;

但利用完后想提高稳定性, 分析fallocate源码并没有找到这一限制. 调试后发现计算出的shmem_sz的值为0xffffffff80000000,这显然不合理. 将PAGE_SIZE改成UL后, 可以用fallocate更大的空间.

1 2	#define PAGE_SIZE 0x1000UL size_t shmem_sz = (0x1000 * 0x7f) * PAGE_SIZE;

punching hole + setxattr 安卓通用堆喷读写

setxattr在内核利用中大多数时候用来写入payload, 但其实也能用来读取.

leak with pipe_buffer

现代内核利用总是使用pipe_buffer结构体, 也确实好用, 但大小限制为0x28以上.

结构体大小弹性, 能适配大多数情况.
头部有一个page指针, 修改后可以读写任意数据.
flags字段可以造dirtypipe
不过在aarch64架构下, DMA_CACHE是0x80字节对齐的, 所以最小的slub是kmalloc-128, 0x20字节的io_buffer可以与pipe_buffer来自同一个slub.

堆喷setxattr并阻塞, 并在中间插入vuln object的分配, 这样可以保证在CONFIG_SLAB_FREELIST_RANDOM的情况下, vuln object的相邻对象大概率是我们的target object. 然后触发invalid free, 再堆喷pipe_buffer进行占位

sem_post(&punching_sem);
for(int i = 0; i < NSETXATTR; ++i)
{
    sem_post(&setxattr_sem); //一个线程开始setxattr
    if(i == NSETXATTR / 2)
        pthread_mutex_unlock(&lock1); //分配vuln object
}

pthread_mutex_unlock(&lock2); // 触发invalid free

这样我们就能将pipe_buffer中的数据写入xattr, 并读出….
然而实际情况没这么简单.

io_uring是异步的, invalid free发生在io_wrk线程的上下文中, 而将io_uring线程绑定到特定cpu的功能直到5.14内核才加入, 安卓5.10系列是没有的. 所以释放的setxattr的内存可能会在任何kmem_cache_cpu上.

if(0 != io_uring_register_iowq_aff(&ring, sizeof(cpu_set_t), &mask))
{
    fprintf(stderr, "++ register failed: %m\n");
    goto done;
}

只能在每个核上都进行堆喷, 才能拿回释放的setxattr. 这在8核的设备上会降低成功率.

#define PIPE_RESIZE 0x2000
    for(int i = 0; i < NCORES; ++i)
    {
        bind_core(i);
        for(int j = 0; j < NPIPEFD_PER_CORE; ++j)
        {
            ret = fcntl(pipe_fds[i*NPIPEFD_PER_CORE + j][1],F_SETPIPE_SZ,PIPE_RESIZE); //kmalloc-128
            if(ret < 0)
                err_exit("Failed set pipe_sz");
        }
    }

construct page-level uaf && aar / aaw with pipe_buffer

当setxattr结束后, pipe_buffer结构体会再次被释放, 但这次会在cpu 0上(exploit线程上下文), 这是稳定的. 于是可以再次使用punching hole + setxattr技术改写pipe_buffer的page字段进行任意地址读写(如果只需要写只读文件, 这里造dirtypipe即可).

但每一次setxattr写完之后都会被释放, 即每次写都会引入一个被噪音占位的风险, 所以决定继续构造page-level uaf.

将pipe_buffer.page字段增加0x40, 也就是指向下一个物理连续的page(提前喷射order-0的页面来构造物理连续). 再关闭其中一个pipe, 完成page-level uaf.

再次喷射pipe_buffer占回刚释放的页面, 就可以使用一个pipe去修改另一个pipe_buffer了, 并且由于pipe中tmp_page的特性, 这就是无限次的任意地址读写原语.最后用读写原语爬取task树, 改写当前cred, 关闭selinux, 完成提权.

其他安卓通用堆喷原语

PR_SET_VMA_ANON_NAME

Linux 5.17之后, 引入了给匿名页设置名称的功能.且需要开启CONFIG_ANON_VMA_NAME(安卓默认开启).

PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. val should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed 80
bytes. If val is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name can
contain only printable ascii characters (isprint(3)),
except ‘[‘, ‘]’, ‘', ‘$’, and ‘`’.

最多80字节, 且只能是合法字符串. 读原语可以通过读/proc/self/maps实现.

struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
	struct anon_vma_name *anon_name;
	size_t count;

	/* Add 1 for NUL terminator at the end of the anon_name->name */
	count = strlen(name) + 1;
	anon_name = kmalloc(struct_size(anon_name, name, count), GFP_KERNEL);
	if (anon_name) {
		kref_init(&anon_name->kref);
		memcpy(anon_name->name, name, count);
	}

	return anon_name;
}

static inline bool is_valid_name_char(char ch)
{
	/* printable ascii characters, excluding ANON_VMA_NAME_INVALID_CHARS */
	return ch > 0x1f && ch < 0x7f &&
		!strchr(ANON_VMA_NAME_INVALID_CHARS, ch);
}

KBASE_JIT_FREE_PREPARE

一些在野利用使用安卓GPU进行堆喷读写操作, 比如Mali GPU的kbase_jit_free_prepare函数提供了大小内容完全可控的强大堆喷原语.
图片1.png

pipe_buffer again

当然用pipe_buffer.page做CrossCache的堆喷读写也是惯用手段了.

DirtyCred

也尝试过用DirtyCred来做这次利用.
主要思想是用高特权级的cred来占低特权级的cred.
或者相反用低特权级的file来占高特权级的file.

在安卓上的问题:

缺少喷射高特权级cred的能力, 因为安卓上没有suid的程序. 但可以通过创建内核线程喷射高特权cred, 比如在5.14后内核的io_uring中, 普通用户也能通过IORING_SETUP_IOPOLL创建内核线程.
喷射file可以, 但写入任意文件原语到提权, 绕过Selinux还有一段距离.

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！