Pre-RFC: FPU as an object

Summary

This proposes to introduce a new object that represents the FPU into seL4.

Motivation

Currently, seL4 treats the FPU as part of the thread context, saved in the TCB on preemption. As FPU state is large, the kernel uses lazy switching for the FPU, meaning on a context switch FPU access is disabled, resulting in an exception when a thread attempts to use it; the kernel will then save and restore the FPU state.

As a result, TCB objects are large (2 KiB on aarch64), and this space is wasted for threads that don’t touch the FPU. This scheme also introduces a timing channel (e.g. cve-2018-3665). By making the FPU access explicit where threads have to present an FPU capability in order to use the FPU, we can reduce TCB size by a half. And by switching to an eager-restore method, we can also mitigate the timing channel.

This will be implemented for all architectures.

Guide-level explanation

Threads which need to use the FPU will now need to have an FPU capability bound to the TCB. To do so, the object must first be created via an seL4_UntypedRetype() call. Once the FPU capability is created, it can be binded to a TCB (one at a time) with seL4_TCB_BindFPU(). If the TCB no longer needs the FPU, an unbind call can also be made via seL4_TCB_UnbindFPU(), which will free the FPU object to be used by another thread. Deleting the TCB object will also unbind the FPU.

Usage of the FPU by TCBs which don’t have the appropriate FPU capability bound to them will raise a fault, transmitted to the thread’s fault handler.

Reference-level explanation (aarch64)

fpu_t

This struct represents the FPU object created from untypeds. It is arch-specific. As an example, the aarch64 FPU registers span 512B + some more for the status/control registers. In order to maintain a compact size, we move a register out into the TCB.

typedef struct fpu {
    /* Last quad-word located in TCB */
    uint64_t vregs[62];

    uint32_t fpsr;
    uint32_t fpcr;

    /* Backlink from fpu to TCB */
    struct tcb *fpuBoundTCB;
} fpu_t;

tcb_fpu_t

This struct resides in the TCB, and stores necessary information about the FPU as well as the last quad-word we took out of the FPU object.

typedef struct tcb_fpu {
    /* Object created from retyping an untyped */
    fpu_t *tcbBoundFpu;

    /* Last quad-word register in the fpu */
    uint64_t q31[2];
} tcb_fpu_t;

The existing FPU save/restore methods will be modified to take into account the missing quad-word register in the FPU object.

TODO: Drawbacks

A side-effect of this change is that most threads will end up having an FPU object bound to them, and since FPU access is currently implicit, it will have to be made explicit by the developer (more work). Even if there is no intended FPU usage, the compiler may sometimes generate instructions which “borrow” the FPU registers. A work around for this can be forcing the compiler to use general registers only. This would, however, prevent FPU usage where it is actually needed.

TODO: Rationale and alternatives

  • Why is this design the best in the space of possible designs?
  • What other designs have been considered and what is the rationale for not choosing them?
  • What is the impact of not doing this?

TODO: Prior art

Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are:

  • For ecosystem proposals: Does this feature exist in similar systems and what experience have their community had?
  • For community proposals: Is this done by some other community and what were their experiences with it?
  • What lessons can we learn from what other communities have done here?
  • Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.

This section is intended to encourage you as an author to think about the lessons from other systems, provide readers of your RFC with a fuller picture. If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other systems.

Note that while precedent set by other systems is some motivation, it does not on its own motivate an RFC.

TODO: Unresolved questions

  • What needs to be resolved in further discussion before the RFC is approved?
  • What needs to resolved during the implementation of this RFC?
  • What related questions are beyond the scope of this RFC that should be addressed beyond its implementation?

My opinion is, that avoiding the FPU context can be an interesting option when targeting low end systems. However, I have some doubts that saving this space really matters for the majority of users, as they don’t care about a few KiByte of RAM. We have to support all the MMU pages anyway, which cost a lot of KiBytes easily. So, how much space would an “average” system really gain? How many threads does a low end system have, that has to run in 4 MiByte RAM? Are seL4 systems with less then 1 MiByte RAM really something to consider? Aren’t there other places where some memory could be found more easily?

There might be simpler options if it’s just about avoiding the side channel, so maybe we should start with addressing this issue first without using FPU objects.

For the Drawbacks, I see this mainly an issue on ARMv8 (I’m not too deep into x86/ia64, unfortunately), where it was defined that FPU register are always there, and there is not much CPU pipeline penalty when using them. So the additional registers are used thankfully nowadays - disabling them is just too painful when using 3rd party code to make this really an option.

If better context save/restore time is also a goal, I’d be interested in the benchmark numbers for different architectures and even different CPUs within one architecture, when comparing

  • disabling FPU completely and just switch core registers
  • always saving the FPU context for every switch.

Always saving/restoring the FPU register context might not be so expensive practically, if the CPU can run this in parallel to other activity due to good pipeline and cache design. There would have to be a worst case and best case benchmark then. And there is the question about what the average case would be, actually. Comparing the average case(s) for “always switching” and “lazy switching” might also be interesting to answer if/where lazy switching really has an advantage. Also this will tell the price of different side channel protection options. Or if there is any gain for ISR latency when these have use the option to be marked as running with FPU disabled.

Axel

Some parts of the RFC template are still TODO, but would be the most interesting part of this discussion :slight_smile:. In particular, I’d be interested in which other points of the design space have been considered and why this one is the best solution in that space.

There seem to be multiple separate issues here:

  • the timing channel introduced by lazy switching
  • the size of the TCB
  • paying for context-switching and memory cost when you are not using the FPU

The timing channel could easily be closed with a much smaller change (force save/restore at domain switch) and would then only impact systems who care about information flow control (i.e. when there is more than one infoflow domain). This is independent of having or not having an FPU object.

For the size of the TCB, there could be other changes that lead to a larger size reduction. I have not looked into it more concretely yet, but @kent-mcleod2 mentioned in passing that rearranging where the cap storage sits in the TCB may well save more space than a separate object. There were problems with rearranging that layout from the verification perspective, but they would definitely be worth exploring, because when that was designed, FPU was no supported on any architecture and would not have entered the discussion.

For not paying for FPU when you don’t use it, what @axel posted is relevant. If an always-on FPU turns out to be cheap and the architecture spec recommends that, the need for saving cost may be less strong than it appears now (and memory cost might be mitigated by a different design for the TCB). Of course that argument is architecture-dependent, but Aarch64 and RISC-V would probably be the main architectures that should drive this (happy to be contradicted here by systems people, this is just what I have observed as the plan for the future from other discussions).

I think the always-on comparison benchmark that @axel suggested would definitely help to make a more informed decision on this one.