Kernel triple fault when retyping untyped memory on x86 qemu

FennelFoxxo · September 2, 2024, 8:09am

Hi all!

I’ve just started messing with seL4, and I’m currently trying to allocate objects from untyped memory passed to the rootserver. However, I’ve noticed that when I call seL4_Untyped_Retype on a specific untyped memory capability, a triple fault occurs in the kernel. There is no error message - the machine simply resets. The qemu log also reports a triple fault occurred.

I suspect this may have to do with the way I’m booting the kernel, because it works fine when I run the simulate script in the build folder. I’m booting it by creating a GRUB boot image that multiboots the kernel and rootserver, and I haven’t had any other issues with this so far. If anyone has any insight into why there might be a difference between the two boot methods, it would be greatly appreciated. Some more details are below:

The fault seems to happen no matter how much memory I assign qemu, the type of object I retype from the memory, the size of the object, or which empty slot I use as the destination.
The physical address of the untyped memory is 0xA00000 and it is 2MB in size. OSDev’s typical x86 memory map doesn’t list anything special about this region.
From what I can tell, the fault seems to be occurring in the memzero function, called from clearMemory, from resetUntypedCap, from invokeUntyped_Retype. Specifically, it occurs when a 0 is written to address 0xffffff8000a0df00.

If any other details are required or if this post belongs somewhere else, let me know! Any guidance is appreciated.

Indan · September 3, 2024, 8:58am

Are you using the current version from git? If not, try that first.

I have no experience running seL4 on x86, so no quick insights that may help you here.

Does the same happen with other untyped memories? Or does any retype call cause a triple fault?

Double check that the machine qemu is emulating matches the sel4 kernel configuration. E.g. hypervisor support etc.

I think the kernel might assume all user accessible memory is above the kernel itself, 0xA00000 may break that assumption. But I’m not sure about this and it may only apply to other platforms than x86, or more likely, not at all.

Or something went quite wrong and the kernel itself is located at 0xA00000.

Indan · September 3, 2024, 9:05am

The OSDev page does mention about 0xA00000 which is in the extended memory range 0x00100000-0x00EFFFFF:

"Free for use except that your bootloader (ie. GRUB) may have loaded your “modules” here, and you don’t want to overwrite those. "

FennelFoxxo · September 3, 2024, 12:22pm

Oops, I think you might be right! I completely missed that footnote on the OSDev page. After double checking the log, I do see the kernel is being loaded from 0x100000 - 0xC12000. Still though, if the kernel knows that address range is being used by itself, it shouldn’t provide it as retype-able memory - after all, userland code has no idea that memory should be off-limits.

With that said, there’s another problematic untyped (4KB @ 0x80A000) that really does seem to be outside any used address ranges. This time I actually get a proper kernel exception message when attempting to retype it, if it may prove useful. Through decompiling, it looks like the IP is at the start of the restore_user_context function this time - weird.

========== KERNEL EXCEPTION ==========
Vector:  0xe
ErrCode: 0x0
IP:      0xffffffff8080a5d0
SP:      0xffffffff80a073e0
FLAGS:   0x92
CR0:     0x8001003b
CR2:     0x0 (page-fault address)
CR3:     0xc0f000 (page-directory physical address)
CR4:     0x668

Stack Dump:
*0xffffffff80a073e0 == 0xffffffff8081de41
*0xffffffff80a073e8 == 0x0
*0xffffffff80a073f0 == 0xffffffff8081e19d
*0xffffffff80a073f8 == 0xffffffffffffffff
*0xffffffff80a07400 == 0xffffffff80a07458
*0xffffffff80a07408 == 0x18d
*0xffffffff80a07410 == 0x0
*0xffffffff80a07418 == 0x0
*0xffffffff80a07420 == 0xffffff80bff48400
*0xffffffff80a07428 == 0xffffffffffffffff
*0xffffffff80a07430 == 0x1086
*0xffffffff80a07438 == 0x18d
*0xffffffff80a07440 == 0x0
*0xffffffff80a07448 == 0x0
*0xffffffff80a07450 == 0xffffffff8081e753
*0xffffffff80a07458 == 0x0
*0xffffffff80a07460 == 0x1000ff800080a000
*0xffffffff80a07468 == 0x100000c
*0xffffffff80a07470 == 0xffffff80bff231a0
*0xffffffff80a07478 == 0xffffffff8081dc0f

Halting...
halting...
Kernel entry via Syscall, number: 1, Call
Cap type: 2, Invocation tag: 1

To answer your other questions though:

Yep, I’m using the current version from git - I just redownloaded all components from git (kernel, tools, musllibc, seL4_libs, util_libs, and runtime)
Most of the time retyping works ok, but sometimes a triple fault occurs, sometimes it just freezes without restarting, and sometimes it prints out a kernel exception message.
I’m running the boot image using the exact same qemu binary and command that the simulate script uses. The only difference is I’ve removed the -kernel and -initrd arguments, and added the -drive argument to use the boot image.

As a demonstration of the issue, I’ve created GitHub - FennelFoxxo/retype_issue: Repo to demonstrate triple fault issue when retyping memory for anyone to check out.

Thank you!

Indan · September 5, 2024, 10:14am

That’s still in the 0x00100000-0x00EFFFFF range…

FennelFoxxo · September 5, 2024, 11:17am

Sure, it is in that range. I just provided it as an example in case the exception log and stack trace would be useful to someone in the future, and as a further example of weird behavior (even if it might be expected given what we’ve learned).

Still though, why does the kernel provide a capability to that memory if I’m not supposed to use it? The kernel knows where itself is loaded - it even prints out the address during boot. I admit I’m not familiar with the codebase, but I imagine it should be possible to exclude these regions from being provided to the root task, no?

Until then, I’ll just use untypeds above 0x00EFFFFF. Giving up a handful of MB on x86 won’t be the end of the world!

Thanks for the help!

Indan · September 5, 2024, 2:38pm

As far as I know, the kernel isn’t supposed to do this. So either the way you boot the system breaks an assumption made by seL4, or it’s a bug in seL4, or a bug in GRUB.

If you want to get to the bottom of this you could add debug prints to the seL4 bootup code.

Topic		Replies	Views
Kernel crash on x86 (but not on x86_64) seL4 kernel	10	396	August 24, 2021
Strange kernel exception while volatile access (#Rust) New to seL4 arm	0	16	September 7, 2024
Rust call cause "vm fault on data at address" New to seL4 arm	14	126	September 12, 2024
Second VirtIO access cause vm fault New to seL4	8	50	September 10, 2024
Device Region Untypeds seL4 kernel	9	496	June 16, 2021

Kernel triple fault when retyping untyped memory on x86 qemu

Related topics