Kernel crash on x86 (but not on x86_64)

I seem to be triggering an unhandled kernel exception when doing seL4_Recv on a capability while the other thread has not yet performed a seL4_Send. This only happens on x86 and not on x86_64. I’m looking for some pointers in terms of how I could debug this. The backtrace seems to suggest that the kernel is trying to return to the idle thread but has incorrectly set tcbContext.registers[Error] prior to this (it seems to have put the return address for the idle_thread into tcbContext.registers[Error]). Here is the kernel exception log

========== KERNEL EXCEPTION ==========
Vector:  0xd
ErrCode: 0x200
IP:      0xe0116405
SP:      0xe0142628
FLAGS:   0x86
CR0:     0x8001003b
CR2:     0x0 (page-fault address)
CR3:     0x13a000 (page-directory physical address)
CR4:     0x290

Stack Dump:
*0xe0142628 == 0x8
*0xe014262c == 0x202
*0xe0142630 == 0x0
*0xe0142634 == 0x0
*0xe0142638 == 0x10
*0xe014263c == 0x0
*0xe0142640 == 0x0
*0xe0142644 == 0x7
*0xe0142648 == 0x0
*0xe014264c == 0x0
*0xe0142650 == 0x0
*0xe0142654 == 0x0
*0xe0142658 == 0x0
*0xe014265c == 0x0
*0xe0142660 == 0x0
*0xe0142664 == 0x0
*0xe0142668 == 0x0
*0xe014266c == 0x0
*0xe0142670 == 0x0
*0xe0142674 == 0x0

Halting...
halting...
Kernel entry via Interrupt, irq 157

Here 0xe0116405 points to the iret instruction in sel4/src/arch/x86/32/c_traps.c:restore_user_context(). At this point the kernel has switched to the idle thread (as can be seen from examining ksCurThread) and is trying to restore the user context of the idle thread, but for some reason the saved context is messed up.

I’m using clang to compile the kernel and the root task by the way.

Hi @cl91, this is an interesting one. The ia32 version isn’t verified, but it is mature and I’d expect the context switching code to be pretty robust by now. The basic scenario you describe (a blocking seL4_Recv) is working fine in the regression tests, so there is likely something more complex going on before it.

Can you say which revision of seL4 you’re using? (which release version or git sha)

To debug this, the first step would be to try to reduce the scenario to a minimal repeatable example so that it becomes easier to figure out where things go wrong. If you can post that example, then others can run it and potentially help.

Thank you @gerwin.klein for the reply! I’m building a small system on seL4 which loads PE (not ELF) images and serves “system services” to these client threads via seL4 IPC calls. The client threads will have their own separate CSpace and VSpace that are isolated from other threads. The seL4 version I’m using is the latest release 12.1.0 (sha 21c1a2ca7). I’m trying to reduce it to a minimal example but the system has around 10k lines of code already, so that might take a while. I can post the full git repo if anyone thinks it’s useful (this is a personal project, not a school assignment or company project lol).

Meanwhile I’ve added some debugging printf’s to the seL4 source — it seems that the user context is off by one when the interrupt handler pushes the user context into the tcbContext struct. Here are the idle thread tcb user context dump immediately after entering c_handle_interrupt

entered c_handle_interrupt from symbol address 0xe0116364 (which is user_interrupt)
c_handle_interrupt's frame pointer 0xe0139010
stack frame (at 0xe0139010) immediately upon entering c_handle_interrupt:
 0xe0138ff8  0xe0116364  0x9d  0  0xe0142400  0  0  0 
user context of the idle thread (which is the current thread) immediately upon entering c_handle_interrupt
EAX = 0
EBX = 0
ECX = 0
EDX = 0
ESI = 0
EDI = 0xe0142630
EBP = 0xe01166e1
FaultIP = 0
Error = 0xe01166e1
NextIP = 0x8
CS = 0x202
FLAGS = 0
ESP = 0
SS = 0x10
FS_BASE = 0
GS_BASE = 0
handleInterruptEntry

========== KERNEL EXCEPTION ==========
Vector:  0xd
ErrCode: 0x200
IP:      0xe0116415
SP:      0xe0142628
FLAGS:   0x86
CR0:     0x8001003b
CR2:     0x0 (page-fault address)
CR3:     0x13a000 (page-directory physical address)
CR4:     0x290

Stack Dump:
*0xe0142628 == 0x8
*0xe014262c == 0x202
*0xe0142630 == 0x0
*0xe0142634 == 0x0
*0xe0142638 == 0x10
*0xe014263c == 0x0
*0xe0142640 == 0x0
*0xe0142644 == 0x7
*0xe0142648 == 0x0
*0xe014264c == 0x0
*0xe0142650 == 0x0
*0xe0142654 == 0x0
*0xe0142658 == 0x0
*0xe014265c == 0x0
*0xe0142660 == 0x0
*0xe0142664 == 0x0
*0xe0142668 == 0x0
*0xe014266c == 0x0
*0xe0142670 == 0x0
*0xe0142674 == 0x0

Halting...
halting...
Kernel entry via Interrupt, irq 157

It appears to me that edi should be zero and ebp should be what edi is, and FaultIP shouldn’t be zero and should be what ebp is. Error should clearly be zero and its value should really be what NextIP should have been. Clearly the CS selector should have been 0x8, and 0x202 should have been the eflags, etc.

Note that this wasn’t the case for the system calls/interrupts that don’t trigger this kernel crash. For instance, the user context dump for the second-to-last system call/interrupt appears quite normal

entered c_handle_interrupt from symbol address 0xe0113e1e (which is user_interrupt)
c_handle_interrupt's frame pointer 0xe013900c
stack frame (at 0xe013900c) immediately upon entering c_handle_interrupt:
 0  0xe0113e1e  0x9d  0  0  0xe0142400  0  0 
user context of the idle thread (which is current thread) immediately upon entering c_handle_interrupt
EAX = 0
EBX = 0
ECX = 0
EDX = 0
ESI = 0
EDI = 0
EBP = 0
FaultIP = 0xe01166d0
Error = 0
NextIP = 0xe01166d0
CS = 0x8
FLAGS = 0x202
ESP = 0
SS = 0x10
FS_BASE = 0
GS_BASE = 0
handleInterruptEntry


entered c_handle_interrupt from symbol address 0xe0113e1e (which is user_interrupt)
c_handle_interrupt's frame pointer 0xe013900c
stack frame (at 0xe013900c) immediately upon entering c_handle_interrupt:
 0xe0142630  0xe0113e1e  0x9d  0  0  0xe0142400  0  0 
user context of the idle thread (which is current thread) immediately upon entering c_handle_interrupt
EAX = 0
EBX = 0
ECX = 0
EDX = 0
ESI = 0
EDI = 0xe0142630
EBP = 0xe01166e1
FaultIP = 0
Error = 0xe01166e1
NextIP = 0x8
CS = 0x202
FLAGS = 0
ESP = 0
SS = 0x10
FS_BASE = 0
GS_BASE = 0
handleInterruptEntry

========== KERNEL EXCEPTION ==========
Vector:  0xd
ErrCode: 0x200
IP:      0xe0116415
SP:      0xe0142628
FLAGS:   0x86
CR0:     0x8001003b
CR2:     0x0 (page-fault address)
CR3:     0x13a000 (page-directory physical address)
CR4:     0x290

Stack Dump:
*0xe0142628 == 0x8
*0xe014262c == 0x202
*0xe0142630 == 0x0
*0xe0142634 == 0x0
*0xe0142638 == 0x10
*0xe014263c == 0x0
*0xe0142640 == 0x0
*0xe0142644 == 0x7
*0xe0142648 == 0x0
*0xe014264c == 0x0
*0xe0142650 == 0x0
*0xe0142654 == 0x0
*0xe0142658 == 0x0
*0xe014265c == 0x0
*0xe0142660 == 0x0
*0xe0142664 == 0x0
*0xe0142668 == 0x0
*0xe014266c == 0x0
*0xe0142670 == 0x0
*0xe0142674 == 0x0

Halting...
halting...
Kernel entry via Interrupt, irq 157

As one can see that in the first user context dump, things looked good, and the second user context dump seems be off-by-one — which I think is what triggered the kernel exception.

I believe that the first (ie. second to last) of the interrupt in the previous user context dump was the timer interrupt, which was handled correctly. The last interrupt which did trigger the kernel exception seems to come from the idle_thread function (more specifically the hlt instruction, which 0xe01166e1 points to). It is also interesting to note that if I add a small delay (by looping a dummy variable) at the very beginning of c_handle_interrupt, it seems to prevent the crash from showing (I suspect it’s because the timer interrupt kicks in at the right time, correcting the user context).

Ok I figured it out! The problem lies in the idle_thread function, which has the usual function prologue push %ebp etc. generated by the C compiler. However the idle thread does not have a stack — the “stack” it uses it the idle thread tcb, so when it pushes %ebp on to the “stack” it offsets the idle thread “user” context by one, which causes a kernel exception later on when the context switching code tried to restore user context. This should have been optimized out (as is pointed out in src/arch/arm/32/idle.c), but for some reason even with -O2 optimization clang does not optimize the prologue away. I think the safest way to write the function is as following

__attribute__((naked)) void idle_thread(void)
{
    asm volatile(
        "1: hlt\n"
        "jmp 1b"
	);
}

Nice find! I like that version better as well. @kent-mcleod2 would you have any concerns with using this for the idle thread on x86?

It’s curious that clang doesn’t do the expected optimisation in this case, I’d like to figure out where that is coming from (it does seem to do it correctly in the regression test builds, and if we can’t rely on it, we should try to figure out when/when not).

Which clang version are you using? Are you using the cmake build system from seL4 or something different to invoke the compiler?

@gerwin.klein I am indeed using the cmake build system from seL4, but the cmake files are somewhat heavily customized (based on the sel4-tutorials and other sel4 projects), mostly because the build system needs to generate both PE and ELF targets (this is precisely why I chose clang, because it is a native cross compiler). I’m not modifying the cmake files under the sel4 kernel source tree though. After checking the build output, it does seem that -O2 is passed to clang when building the kernel, so I think it comes down to the compiler not always optimizing the prologue away even with -O2 (perhaps when -g is passed) — I’m not familiar with clang enough to speculate how it behaves though. I’m using clang version 12.0.1 from the official archlinux repository.

Anyway, I think it’s probably a good idea to force the compiler to eliminate the function prologue for idle_thread, without relying on the optimization flags to do so, since the system wouldn’t behave correctly otherwise. I have made a patch (it’s a trivial patch really) and if you accept outside contributions I can submit it on github. If you prefer in-house fixes that’s also fine.

Out of curiosity, how does this affect the verification story? I know ia32 is not verified, but x86_64 is. Presumably this can also happen in x86_64 targets? If the compiler doesn’t optimize the idle_thread prologue away, doesn’t this make the x86_64 binary functionally incorrect?

Ok, I think the test is on clang 8, so there might be quite a bit of difference, and -g is possibly the bit that switches off that particular optimisation. I agree that your version is safer.

Perfectly happy to take a pull request on GithHub, see doc site for conventions (this one is straightforward).

For the verification side, the implementation of the idle thread is outside the verified code base, so this would hit the verified x64 version as well. There are tests for the code outside of the verified code base, of course, that’s why I was keen to figure out where they might be going wrong.

Ok I submitted a pull request x86: Always eliminate function prologue for idle_thread() by cl91 · Pull Request #510 · seL4/seL4 · GitHub

The only problem I can see is that some older compilers may not have implemented the __attribute(naked). clang I believe has it around 2017. GCC has this attribute I believe dating back in 2018, so it should be reasonably safe to assume that most recent compilers have it (MSVC has __declspec(naked) since forever).

My practical worry about the “naked” attribute is not that it is supposed, but also that it is implemented as expected. Even if the compilers supports it, there are some issues around to be aware of, see Bug List. It be better to write plain asm code in dedicated asm files then.