Summary
This RFC discusses the implementation of a new mapping primitive, protectN (and by extension unmapN), as well as changes to page_map which enables more seamless use of stale capabilities, which are utilized by the new unmapN primitive. Another related proposal is to move mapping invocations to the vspace instead of as they currently are i.e. current page_map is an invocation on a frame cap which takes in a vspace cap whereas the new page_map is an invocation on a vspace cap which takes in a frame cap.
The images I have provided are a little difficult to see in discourse but if you open them in a new tab, they should be okay.
Motivation
Protection is a commonly used memory management technique to reduce the availability of a page in an address space. The way a user would currently protect a range of pages in seL4 is by remapping each page individually using page_map. This can be very inefficient, as each time a frame is protected, a context switch to the kernel space is required to make the appropriate changes to the page table and flush the necessary TLB entries. These overheads can quickly add up, which is why a more efficient protectN primitive which batches protection to amortize the related costs is an important feature.
Adding a new, more efficient protectN primitive will improve the performance of several different applications, as it is a commonly used operation, with notable examples being in the implementation of copy-on-write (COW) and garbage collectors. More info can be found https://dl.acm.org/doi/10.1145/106972.106984. The diagram below shows copy on write and how the entire parent address space is protected to read after a fork and then unprotected as required when making making copies.
before fork() | after fork() | after write() |
---|---|---|
Map is changed to make unmapN easier to use and we move this to be an invocation on the vspace as it seems more logical at a high level for a mapping to be invoked on a vspace, taking in the capability of the frame to use as opposed to the opposite, where mapping is invoked on the frame passing in the vspace. If this change was accepted, the same would be done for all different levels i.e. pud, pd, pt would all be changed so that to map them, you would invoke the vspace and pass in the necessary cap to each of these objects instead of the current way.
Guide-level explanation
The implementation will be a new mapping primitive which is invoked on the top-level paging structure i.e. the vspace, which is different for all architectures i.e. pgd aarch64, pml4 x86_64. The invocation will be called RangeProtect
and takes in three parameters. These are the start and end addresses of the range which is to be protected (both should be page aligned), and the minimum rights to set the pages to
To further explain the rights system, say that the provided argument is seL4_CanRead
. At this, any pages which are currently mapped as seL4_ReadWrite
in the range will be protected to seL4_CanRead
. Nothing will happen to pages which are already seL4_CanRead
or seL4_NoRights
. Invoking this method with seL4_AllRights
doesn’t make sense, as setting this to the minimum protection level will have no effect on any mapping. Instead, we redefine this so that if the provided rights are seL4_AllRights
, RangeProtect
will behave as unmapN instead and erase/invalidate the affected page table entries.
There is no guarantee that this new invocation will protect the whole range at once. For large ranges (depending on the structure of the page table), it may reach a certain point before returning to the user with the number of pages it was able to protect as well as the address after the last page that was looked at, at which point the user can perform the system call again with the new address as the start of the range. If there is ever a situation that the address of a small page is given but this page does not exist as there is a large page in its location, the limit of the range will be exceeded, and this page will be protected in its entirety.
An important note about unmapN is that because it works on the page table only, it results in stale capabilities. A stale capability is the result of a mapping which no longer exists but is still held in a frame capability. In the diagram below, frame_cap1 becomes stale due to its mapping being overwritten by another.
As unmapN leaves these stale capabilities, with the current implementation of map, it would be necessary to unmap the capability explicitly before the user is able to re-use the frame and map it to another page. Therefore, a new version of map has been implemented which allows stale frames to be identified and reused within the map itself instead of them having to be explicitly unmapped after they become stale.
Reference-level explanation
protN is essentially a remap which iterates through a fixed number of pages (of various sizes) and changes their relevant page table entry to protect/unmap them before returning to the user. This is relatively straightforward, as it checks for a particular address (starting at the beginning of the range), whether a page is available and what size it is. If a page exists at the address, it determines its size, modifies the page table to protect the page and increments the current address by the appropriate amount so that it is pointing to the next address after the page which was protected. If no page is mapped for this virtual address, it finds the lowest level page table structure and considers the next spot where a page could be in the range. Instead of flushing the TLB for each entry, we only cleaning up the TLB per page, we do a flush at the end of the iteration before returning the number of pages which were protected and the virtual address after the last page which was iterated through to the user.
When we unmapN, we just set the page table entries to an empty entry like nothing was mapped there. As mentioned before, this results in stale capabilities as the frame capabilities which were used to map these pages are unaffected by unmapN and still hold a reference to the mapping. With the current version of map, this is problematic as mapping with the frame capability will fail unless the address being mapped to is the same as before. The new version of map is the same but attempts to solve this by doing a lookup of the address which the frame capability is mapped to. It looks for the relevant page table entry and if it exists, it checks whether this entry holds a reference to the unique physical address of the frame capability being used in the mapping. If it does, then the capability is not stale and cannot be remapped without an unmap, but if the page table is not consistent with the mapped address stored in the capability, that means the capability has become stale and can safely be overwritten and reused for another mapping
Drawbacks
- Stale caps are used as a feature, which is not something which has been done much in seL4.
- Moving invocations to vspace instead of themselves overloads the functionality of the vspace so that it has several invocations while the lower-level paging structures like page directories, page tables and pages are made more bare as they will only be used for unmapping.
- Moving the new map to the vspace can make it so that the vspace has too many responsibilities.
- The solution of protecting the entire first/last page even if it falls out of the range due to page size discrepancy between provided address and actual mapped page at location is sketchy.
Rationale and alternatives
- Initially there was no cap on the number of pages which were iterated through by the kernel in a single invocation, meaning the whole range could be protected in one go. This was problematic because there is no standard method of kernel pre-emption in seL4 and when a large/dense enough range was used, this would mean that the invocation would stay in the kernel for hundreds of thousands of cycles, which is very problematic. With the limitation on the number of pages, the per-page performance is worse than before but the length of time the invocation spends in the kernel is much more reasonable.
- Another design which was considered is using a start address, number of pages approach instead of a virtual address range approach. This was discarded because it was painful to use - it required a deep understanding of the page table structure and could easily unmap regions which the user did not meant to be unmapped if it was not used carefully, as the region it would unmap depended on which page table structures were present (for efficiency, large address regions can be skipped if certain levels of the page table are not defined).
- It may be better to use kernel pre-emption as opposed to returning to the user as this is easier to use from user-side, but I was unsure how to do this as the state needs to be saved somewhere and I haven’t seen any examples of kernel pre-emption.
- An alternative is that if the page does begins/ends outside the range but is partially in the range, it is not counted as in the range and will not be protected. The problem with this is that it might be unclear which pages are affected and which are not, but this is also a problem to a lesser extent with the current design.
Prior art
- Linux provides
mprotect()
, which is essentially protectN. - Linux also provides
munmap()
, which is essentially unmapN. - https://dl.acm.org/doi/10.1145/106972.106984
Unresolved questions
- I think it might be better to explicitly check the first and last pages of the range prior to iteration to ensure that they match exactly to the page size there to ensure that it is impossible to leave the bounds of the range.
- Should there be any provision to deal with VM attributes as well i.e., exec rights?