Pre-RFC: ProtectN

Summary

This RFC discusses the implementation of a new mapping primitive, protectN (and by extension unmapN), as well as changes to page_map which enables more seamless use of stale capabilities, which are utilized by the new unmapN primitive. Another related proposal is to move mapping invocations to the vspace instead of as they currently are i.e. current page_map is an invocation on a frame cap which takes in a vspace cap whereas the new page_map is an invocation on a vspace cap which takes in a frame cap.

The images I have provided are a little difficult to see in discourse but if you open them in a new tab, they should be okay.

Motivation

Protection is a commonly used memory management technique to reduce the availability of a page in an address space. The way a user would currently protect a range of pages in seL4 is by remapping each page individually using page_map. This can be very inefficient, as each time a frame is protected, a context switch to the kernel space is required to make the appropriate changes to the page table and flush the necessary TLB entries. These overheads can quickly add up, which is why a more efficient protectN primitive which batches protection to amortize the related costs is an important feature.

Adding a new, more efficient protectN primitive will improve the performance of several different applications, as it is a commonly used operation, with notable examples being in the implementation of copy-on-write (COW) and garbage collectors. More info can be found https://dl.acm.org/doi/10.1145/106972.106984. The diagram below shows copy on write and how the entire parent address space is protected to read after a fork and then unprotected as required when making making copies.

before fork() after fork() after write()
COW_prefork COW before COW after

Map is changed to make unmapN easier to use and we move this to be an invocation on the vspace as it seems more logical at a high level for a mapping to be invoked on a vspace, taking in the capability of the frame to use as opposed to the opposite, where mapping is invoked on the frame passing in the vspace. If this change was accepted, the same would be done for all different levels i.e. pud, pd, pt would all be changed so that to map them, you would invoke the vspace and pass in the necessary cap to each of these objects instead of the current way.

Guide-level explanation

The implementation will be a new mapping primitive which is invoked on the top-level paging structure i.e. the vspace, which is different for all architectures i.e. pgd aarch64, pml4 x86_64. The invocation will be called RangeProtect and takes in three parameters. These are the start and end addresses of the range which is to be protected (both should be page aligned), and the minimum rights to set the pages to

To further explain the rights system, say that the provided argument is seL4_CanRead. At this, any pages which are currently mapped as seL4_ReadWrite in the range will be protected to seL4_CanRead. Nothing will happen to pages which are already seL4_CanRead or seL4_NoRights. Invoking this method with seL4_AllRights doesn’t make sense, as setting this to the minimum protection level will have no effect on any mapping. Instead, we redefine this so that if the provided rights are seL4_AllRights, RangeProtect will behave as unmapN instead and erase/invalidate the affected page table entries.

There is no guarantee that this new invocation will protect the whole range at once. For large ranges (depending on the structure of the page table), it may reach a certain point before returning to the user with the number of pages it was able to protect as well as the address after the last page that was looked at, at which point the user can perform the system call again with the new address as the start of the range. If there is ever a situation that the address of a small page is given but this page does not exist as there is a large page in its location, the limit of the range will be exceeded, and this page will be protected in its entirety.

An important note about unmapN is that because it works on the page table only, it results in stale capabilities. A stale capability is the result of a mapping which no longer exists but is still held in a frame capability. In the diagram below, frame_cap1 becomes stale due to its mapping being overwritten by another.

As unmapN leaves these stale capabilities, with the current implementation of map, it would be necessary to unmap the capability explicitly before the user is able to re-use the frame and map it to another page. Therefore, a new version of map has been implemented which allows stale frames to be identified and reused within the map itself instead of them having to be explicitly unmapped after they become stale.

Reference-level explanation

protN is essentially a remap which iterates through a fixed number of pages (of various sizes) and changes their relevant page table entry to protect/unmap them before returning to the user. This is relatively straightforward, as it checks for a particular address (starting at the beginning of the range), whether a page is available and what size it is. If a page exists at the address, it determines its size, modifies the page table to protect the page and increments the current address by the appropriate amount so that it is pointing to the next address after the page which was protected. If no page is mapped for this virtual address, it finds the lowest level page table structure and considers the next spot where a page could be in the range. Instead of flushing the TLB for each entry, we only cleaning up the TLB per page, we do a flush at the end of the iteration before returning the number of pages which were protected and the virtual address after the last page which was iterated through to the user.

When we unmapN, we just set the page table entries to an empty entry like nothing was mapped there. As mentioned before, this results in stale capabilities as the frame capabilities which were used to map these pages are unaffected by unmapN and still hold a reference to the mapping. With the current version of map, this is problematic as mapping with the frame capability will fail unless the address being mapped to is the same as before. The new version of map is the same but attempts to solve this by doing a lookup of the address which the frame capability is mapped to. It looks for the relevant page table entry and if it exists, it checks whether this entry holds a reference to the unique physical address of the frame capability being used in the mapping. If it does, then the capability is not stale and cannot be remapped without an unmap, but if the page table is not consistent with the mapped address stored in the capability, that means the capability has become stale and can safely be overwritten and reused for another mapping

Drawbacks

  • Stale caps are used as a feature, which is not something which has been done much in seL4.
  • Moving invocations to vspace instead of themselves overloads the functionality of the vspace so that it has several invocations while the lower-level paging structures like page directories, page tables and pages are made more bare as they will only be used for unmapping.
  • Moving the new map to the vspace can make it so that the vspace has too many responsibilities.
  • The solution of protecting the entire first/last page even if it falls out of the range due to page size discrepancy between provided address and actual mapped page at location is sketchy.

Rationale and alternatives

  • Initially there was no cap on the number of pages which were iterated through by the kernel in a single invocation, meaning the whole range could be protected in one go. This was problematic because there is no standard method of kernel pre-emption in seL4 and when a large/dense enough range was used, this would mean that the invocation would stay in the kernel for hundreds of thousands of cycles, which is very problematic. With the limitation on the number of pages, the per-page performance is worse than before but the length of time the invocation spends in the kernel is much more reasonable.
  • Another design which was considered is using a start address, number of pages approach instead of a virtual address range approach. This was discarded because it was painful to use - it required a deep understanding of the page table structure and could easily unmap regions which the user did not meant to be unmapped if it was not used carefully, as the region it would unmap depended on which page table structures were present (for efficiency, large address regions can be skipped if certain levels of the page table are not defined).
  • It may be better to use kernel pre-emption as opposed to returning to the user as this is easier to use from user-side, but I was unsure how to do this as the state needs to be saved somewhere and I haven’t seen any examples of kernel pre-emption.
  • An alternative is that if the page does begins/ends outside the range but is partially in the range, it is not counted as in the range and will not be protected. The problem with this is that it might be unclear which pages are affected and which are not, but this is also a problem to a lesser extent with the current design.

Prior art

Unresolved questions

  • I think it might be better to explicitly check the first and last pages of the range prior to iteration to ensure that they match exactly to the page size there to ensure that it is impossible to leave the bounds of the range.
  • Should there be any provision to deal with VM attributes as well i.e., exec rights?

Thanks for the detailed write-up @alwin-joshy!

So as this is a draft RFC at this stage, here are some comments I have:

  • If you have an implementation of this already, then it would be good to link as it would provide more concrete information about all of the implementation details.
  • It would be helpful to signpost how the 3 different changes interact with each other. It looks like changing page_map to use the vspace cap instead of the frame cap as the primary cap isn’t required to enable the implementation of protectN and unmapN and it’s more of an aesthetic change in order to group all mapping and unmapping operations under the single vspace cap, but stating what dependencies there are between the different changes would help understand them better.
  • Considering it in isolation, changing Page_Map to swap the two caps around probably wouldn’t be encouraged as it would require updating a lot of existing dependent code and it’s not clear what the benefit would be unless you start considering cap Virtualization designs.
  • Expanding the use of stale caps means expanding the scope of this caveat: seL4/CAVEATS-generic.md at master · seL4/seL4 · GitHub. The more instances of stale caps there are, the more difference there is between the system policy structure defined by the capability derivation tree and what the hardware is actually enforcing.
  • Are there already existing GC implementations or applications heavily using COW sharing that are too slow without these changes?
  • I don’t quite understand how the TLB flushes are being batched.
  • Is unmapN as important as protectN? Issues with caveats and flow-on changes to Page_Map are due to unmapN creating stale caps whereas only adding protectN and leaving the mappings in place when downgrading from seL4_CanRead to seL4_NoRights would be the same as if the downgraded rights had been given to Page_Map in the first place and not create any new stale caps.

Without good examples of what applications this is improving, it’s hard to judge how much benefit the performance optimization gives in exchange for the cost of merging it into the existing verified configurations, but protectN by itself looks like it could be a good addition.

Hi Kent, thanks for your feedback.

If you have an implementation of this already, then it would be good to link as it would provide more concrete information about all of the implementation details.

I have an implementation, but it is still rough around the edges and messy. You can find it here: New map pr by alwin-joshy · Pull Request #4 · alwin-joshy/seL4 · GitHub

It would be helpful to signpost how the 3 different changes interact with each other. It looks like changing page_map to use the vspace cap instead of the frame cap as the primary cap isn’t required to enable the implementation of protectN and unmapN and it’s more of an aesthetic change in order to group all mapping and unmapping operations under the single vspace cap, but stating what dependencies there are between the different changes would help understand them better.

This is correct. Changing map to be a vspace invocation is a purely aesthetic change. However, the changes to the logic of map which enable it to more easily account for stale caps relate with the usage of unmapN. The new version of map can be useful for other situations where stale caps can be produced in seL4, with the two I can think of being page table deletion and overmapping. Both of these leave stale caps which can be more seamlessly re-used by the new version of map. unmapN doesn’t need the new map to function, but without it - it becomes pointless. This is because an explicit unmapping of the frame capability will be required after an unmapN to allow for the associated frame cap to be reused. protectN does not have any dependencies and will work without any other changes.

Considering it in isolation, changing Page_Map to swap the two caps around probably wouldn’t be encouraged as it would require updating a lot of existing dependent code and it’s not clear what the benefit would be unless you start considering cap Virtualization designs.

I agree, I found this to be annoying when refactoring code to swap the invocation around, especially on x86. There is little practical benefit to this change besides what is, in my opinion, a more logical presentation. I agree that it would definitely be a pain to update and completely understand why this may not be viable.

Expanding the use of stale caps means expanding the scope of this caveat: seL4/CAVEATS-generic.md at master · seL4/seL4 · GitHub. The more instances of stale caps there are, the more difference there is between the system policy structure defined by the capability derivation tree and what the hardware is actually enforcing.

This is true, but is it really problematic as long as it is clearly defined? When the user uses unmapN on a range of pages, they expect the frame caps associated with the range to become unmapped and these changes just allow them to use the caps as if they were actually unmapped. I’m not sure if I understood exactly what you meant by this so maybe I’m talking about something else entirely.

Are there already existing GC implementations or applications heavily using COW sharing that are too slow without these changes?

I have not tried any specifically, but there is a very significant performance improvement for a range of pages. For example, the cost of prot1 on AARCH64 is ~600 cycles whereas the cost per-page of protN approaches 90 cycles for large ranges.

I don’t quite understand how the TLB flushes are being batched.

When doing single protection, the usual method (on ARM at least) is to flush the specific entry referring to the related page from the TLB. When doing a protectN, it may be more reasonable to enforce that using protN opts the user into flushing all the entries from the ASID associated with the vspace one time instead of flushing each one line by line. This definitely makes protN faster, but the negative consequences can only really be understood by macro-benchmarks to see how this affects the performance of the system overall, which I haven’t done many of.

Is unmapN as important as protectN? Issues with caveats and flow-on changes to Page_Map are due to unmapN creating stale caps whereas only adding protectN and leaving the mappings in place when downgrading from seL4_CanRead to seL4_NoRights would be the same as if the downgraded rights had been given to Page_Map in the first place and not create any new stale caps.

This is completely true, protectN has nothing to do with map as protection has no effect on the capability used. I would agree with you that protectN has more uses than unmapN, but I think that unmapN can still be a useful function. For example, in a scenario where a server has a pool of frames which it allocates to clients, it can be very inefficient to have to iterate through the client’s page table and unmap every single frame capability one by one in the case that the client no longer needs them. There are a few things that can be done to make this more performant, and both involve stale capabilities and the new version of map. The first is if the rest of the page table structrues i.e. pgd, pud, pd, pt are not required. These can all be cleaned up and destroyed, in which case the frame capabilities will become stale and can be reused with the new version of map. The other, if the page table structures are still required, would be to use unmapN to more efficiently unmap the pages before reusing the frames with the new version of map.

One additional concern of mine about the protectN proposal is that it introduces a confused deputy problem for some designs:

Currently, if there’s a process launcher A and a process B, if B given a cap to its own VSpace it can use it to create new mappings. If B’s vspace already has any frames mapped, unless B also has caps to those frames it can’t change the mappings. This allows A to set up sections of memory such as .text + .data + .bss and “lock” them in by not giving the frame caps holding the mapping info to B. This is how all CAmkES components are loaded at the moment.

With unmapN B gets a new way to delete these mappings when it couldn’t before. With protectN B can now downgrade .data and .bss mappings that were originally locked as ReadWrite and make them ReadOnly when it couldn’t before.

Maybe there’s a fault handler thread in B that expects to be able to run some code with it’s own stack. It assumes that when it’s running, the other thread isn’t running because it’s faulted. It relies on the stack, .text + rodata sections to be able to reset corrupted state of the component to be able to recover from crashes. It currently can rely on those mappings and content of those mappings being protected by the kernel and by A not giving the mapping Frame caps to B.

With protectN, the other thread in B can request the kernel to downgrade any ReadWrite mapping to a ReadOnly mapping, including the fault handler thread’s stack region. This access control violation isn’t currently possible but would become possible because the kernel would no longer require proof of control of the mapping cap to be able to downgrade it’s access permissions.

That’s an interesting point. Could B not already break these mapping if it had access to the vspace cap? AFAIK, all architectures except RISCV already support overmapping, meaning that an existing mapping can always be overwritten by a new one. So in this way, if B has the vspace cap as well as the ability to independently create new mappings, would it not be able to overwrite the .data and .bss mappings with new ones that have whatever permissions it wants?

That is a very good point, I initially hadn’t realised that you need only the VSpace cap for protectN and unmapN. That is a fairly major departure from the current API and access control regime. You can currently unmap with a stale cap, but only if you have actually access to that cap. With the new calls, strictly less authority is needed and the set of policies you can express is strictly smaller.

This might be fixable with a requirement of presenting caps to all pages that are touched in those calls. That wouldn’t have to be in syscall arguments, you could require them to be there in some consecutive from in the CSpace.

I’m not convinced at all that a difference between 600 cycles and 90 cycles per page is even a measurable difference for a GC end user. I think we do need macro benchmarks for this feature before we can consider it.

Apart from the decrease in policy expressiveness (i.e. you do need the frame caps), both features can currently be implemented fully at user level (even the swap of the caps could be simulated in a library). Preemption would be trivially handled by the kernel, complexity could be fully hidden from the user. The only reason to add this to the kernel is performance, so the performance argument for something like this should be rock solid.

The difference is that to remap you need to present the VSpace cap and the (maybe new) frame cap.

So if I’m not misunderstanding, the problematic situation would be where B has access to the vspace cap but not to any frame caps which it could use to overwrite the mappings? And in this case, this proposed design of protN would let them overwrite the access rights of the mappings anyway.

This is what I interpreted unmapN/protectN to be at first, but after discussing the design further with Kevin and Gernot, I think we came to the conclusion that it was not practical to expect the frame caps to be consecutive in the cspace. However, this design would likely not be especially difficult to implement if this assumption was made.

That is correct. I was initially thinking the current design is more strict, but since the merge of Map and Remap, we don’t demand any more that the new cap presented for Map points to the same frame (which Remap did require). So now you would require no frame caps at all be present in the client that you want to restrict.

I think this may have actually been an oversight in the Map/Remap merge. Those didn’t go through more rigorous RFC discussions yet, and I’m not sure we thought this aspect through in depth.

Yeah, this is why I was confused I think. Map currently does not require you to validate that you have access to the cap which was originally used for the mapping when you do a remap.

:+1: Only RISCV still does.

So I think there are several separate decisiosn to be made:

  • was the Map/Remap merge accidentally too permissive? (If yes, the rest should check as RISC-V, if no, RISC-V should be updated to not check, as the rest does)
  • do we want/need a protectN call?
  • do we want/need a unmapN call?
  • do we want/need to swap the invocation for Map?

I think all of these are largely independent and could be separate RFCs. I.e. you can still do protectN if you don’t have any of the other 3, etc.

1 Like

Oh, also, before we let that become ingrained too much: I don’t think protectN is a good name in the seL4 context (protect what from what? are other things not protected?). It might make sense in the Linxu context, but for seL4 it describes a possible user intent, but not the mechanism like the other API names.

Maybe something like ReduceRigthtsN? Not sure. Probably better to figure out the rest first.

1 Like

Removing Remap only removed the check for Aarch64. Aarch32 and x86’s implementation appear to have allowed Page_Map to overmap existing mappings (leaving stale mapping caps) since at least 3.0.0 (which contradicts what I claimed before) while RISC-V still has a check and returns an error if the mapping entry already exists. All architectures prevent a new mapping of a different size from overwriting an existing mapping. Stale frame caps are part of the caveat Gerwin refers to, but the way that the caveat is worded implies that it’s not supposed to be possible to create them via overmapping, only by deleting a VSpace.

Yes, that is what we were thinking when we wrote it. We can dig deeper if there was a spec change there at some even earlier time, but it is entirely possible that we’ve missed it. It wouldn’t show up in access control proofs, because they have to deal with the unmap case already so would probably allow it.

The manual also describes that an error is returned if a mapping already exists at the supplied vaddr: seL4/sel4arch.xml at master · seL4/seL4 · GitHub