Selecting right strategy for realizing processor diagnostic-self-testing at runtime for safety

As I’m new to seL4, locking for the right strategy to realize processor diagnostic-self-testing at runtime for a safety application. As I understand “normal” seL4 task operate on EL0 (ARM) and therefore have only limited access to assembler commands and no access to special registers etc. (e.g. verifying cache functionality, mmu functionality, …)
To my current understanding I cannot realize this with a normal seL4 task, but an extension of the kernel is required to establish a, let’s call it system-call - callback, to attach such function to a seL4 system, operating at kernel EL level or below.
What do you think?

Can you be more specific about the functions you need? There are some tracing features available in the kernel that can be enabled. There are also some branches that try to improve things here to support more trace points. There is also a syscall that can be enabled to execute a function in kernel context.

Obviously I’m to brief. We are evaluating seL4 for use in highly dependable systems, according to IEC61508 and EN50128 with safety integrity level SIL3/4. From these regulations not only approved software is required, but also detailed runtime diagnostics of all hardware components involved in safety functions.
For the processor unit this means e.g.:

  • verifying each processor command
  • verifying each processor register
  • verifying effectiveness of memory protection/separation (MMU)
  • verifying caches (L1, L2, L3…) to be free of stuck-at errors (data/addresses)

    and other components like
  • verifying system timer (clock) to stay in its allowed drift window of some ppm
  • verifying memory (DDR ram) for stuck-at/cross-over/SEU errors (data/addresses)

Today with our own kernel, we establish the software modules related to these runtime diagnostics (verification) as modules called at kernel level.

If there is a syscall to let such modules executing at kernel level, as you say, this sounds reasonable way.
(Of course I know that such modules need be trusted and separation demands are somewhat eroded. But as told in the formal verification papers, you have to rely on proper hardware function…)

Interesting requirements. Currently there are no such system calls in seL4, but it sounds like most of these tests (if done in a traditional way) could be implemented in a high-privilege user-level thread without kernel involvement. In fact, some of these are done in the sel4test app for testing the kernel (e.g. MMU/kernel providing separation, some cache tests). All of these sel4test runs are implemented on user level. Same should be possible for RAM, clock, user-level registers and user-level processor commands.

For testing privileged instructions and registers, we would likely need to add a system call that does these. Do you have any pointers to example implementations of such tests on the architecture and platform you are interested in?

(I do feel the need to point out that the traditional way of establishing these things does not actually mean that much. E.g. you are not guaranteed to get separation from the MMU at runtime just because it provided separation in a few test cases. But I can see how it is useful to at least have a diagnostic test to make sure you are not in a state where it just always fails.)

Yes you are right, I think that as much as possible should be done on a high priority user-level task for each core. Even for user-land memory having a separate non-cacheable view on the physical memory will be a way to go.
For privileged commands and resources, exception handling, kernel memory etc. some kernel level solution might be necessary.
(If you have a bunch of “very traditional” notified-bodies and national regulative authorities, not only reasonable technical solutions but also effort to justify these to all of them are considered.
Nevertheless I still open for new approaches.)

I might be missing something here but have you considered running your testing procedure as a sort of additional boot loader stage (i.e. run the diagnostics that aren’t feasible to run from inside seL4 as bare metal code and only then proceed to boot into seL4)? If your regulations permit that it might be easier than having to modify the kernel.