Testing infrastructure

We’ve been discussing testing infrastructure a lot in the past few months and this topic aims to collect that discussion.

Currently we have the following setup:

  • D61/TS runs an internal bitbucket and bamboo server that run:
    • pull request checks (style, compile, licenses, etc)
    • continuous integration test (either on the master branch of a specific repo, or, more commonly on repo collections/manifests, e.g. sel4bench, sel4test, camkes, verification). If any of these change, the entire collection is tested. Sometimes they are out of sync (simultaneous PRs to multiple repos) and break because of that, and sometimes something actually breaks that was not caught in a pull request check
    • continuous release to GitHub: if a subset of the CI tests passes, the master branches of the corresponding set of repos is automatically published to GitHub.
    • point releases (“actual” seL4 releases) automatically tag the corresponding repos and create manifests, all after the corresponding checks are passing
    • verification has a “testboard” check that can check any arbitrary verification manifest as long as bamboo can pull from the corresponding repo (can be on GitHub or bitbucket)
    • the systems has a slightly more ad-hoc, but faster setup where you can run a specific manifest manually (is this a correct description?)

This kind of works, but contributing via GitHub is painful:

  • if you make a PR on GitHub, you are not seeing the actual current state of the master branch
  • you have no visibility of the current CI status (are hardware tests passing/failing on master etc)
  • this leads to more perception of slowness than there actually is, and it can be quite frustrating to think everything is ready for merge, but nothing is happening for days and weeks, because some other repo somewhere is breaking things or some hardware is flakey and the GitHub deployment didn’t work for some reason, but nobody noticed.

So, the proposal is to get rid of the bitbucket step and make GitHub the source of truth for all seL4 repos. This is in line with the handover to the foundation, who should be running the main copy of the repos in any case.

What do we need to do to make this happen and to make it a nice and efficient experience for contributors and developers?

I should add that the verification repo has made this step:GitHub is the main repo, and pull requests get merged there directly.

The GitHub CI checks there are a nicer development experience than bitucket/bamboo, with the caveat that the GitHub test runners are not large enough for large proofs. So we still depend on bamboo for that and for full continuous integration checking.

What we have not solved there yet are simultaneous PRs to say seL4 and verification (e.g. verifying an API change). The test for that currently just fail. You can still request a testboard run from bamboo with the configuration you want in the end, but the results are only D61 internal. It would be nicer to do large proofs directly on AWS and develop a more general mechanism for simultaneous PRs. I’ll split off the AWS discussion, because it is verification specific, but we should think about simultaneous PRs here.

And one final reply to myself :slight_smile:

  • AFAICs all software tests on bamboo can be ported to GitHub without blockers (it’s just work)

  • hardware tests are harder, proposals welcome

  • proposal for simultaneous PRs: detect these by checking all the repos in the corresponding manifest that is being tested for PRs with the same branch name. I.e. if we have a PR in seL4 with branch lsf37/feature-x, check the l4v repo also for a branch named lsf37/feature-x and use that in the test manifest when it exists

I suggest looking into tooling like https://buildkite.com, which is free for open source and can delegate to AWS or behind the firewall stuff. And it integrates with github. I haven’t used it in anger through, so YMMV.

A machine_queue (or something similar) opened to public would be very useful for testing. A user may test his image by submitting the image to a specific hardware platform through machine_queue and get the testing results.

The machine_queue may even support hardware platforms contributed by users so that we could have a shared testing facility.

As far as bootloaders go, @apavey wrote RFC-4 mainly because he and I were working on Polarfire support and the bootloader on the board requires the image to go through this tool called “bin2chunk” and supported either booting to uboot or the bin2chunked image directly. The process was easier when we had access to variables in the cmake environment. We got to hacking on the seL4 helper tools and thought that if we laid the ground work for a more general support for boot loaders it could be beneficial to more than just our platform. It required taking the bbl step out from where it is now and moving it down to the end of the rootserver cmake script after the image is created. The RFC was originally created because it required some big changes and added this idea of supporting bootloaders. We (Alex and I) hadn’t realized that the RFC process was mainly just for kernel topics. In hind sight maybe we should have brought the issue up here, on Discourse.

I got thinking though, it might be better to take the bbl and all bootloader stuff out and put it in it’s own cmake project. It could be something that is run with a function like “BuildBootLoader()” and just grabs the final image and other global config options. Then you can keep the actual elf-loader/kernel/userspace image separate from the bootloader or bootloader+image binaries. It would give a good place to add supporting fit images and things like that too.

This might be useful for running tests on hardware, where you might want to be certain that the image you are running is being run the way you expect. It might open up the space for future bootloaders to be supported in this infrastructure.

I suggest looking into tooling like https://buildkite.com , which is free for open source and can delegate to AWS or behind the firewall stuff. And it integrates with github. I haven’t used it in anger through, so YMMV.

This looks very interesting, might be exactly what we need.

Not sure I’ll get to it before the meeting on Fri, but I might try setting up a test account and play with it a bit.

Looking at buildkite, It seems like the agent is free and open source, but still relies on buildkite.com to do scheduling and things,

I had at some point in the not so distant past been working with local git infrastructure for mirroring sel4 repo manifests locally/ for easy building and pushing to github, and had considered also building up testing/CI infrastructure which also ran locally.

I think it would be nice in such cases, if this kind of setup could be done entirely locally without the external coordinator, the tools I had written for local mirroring of repositories is at https://pullreqr.github.io/ And I had been looking at https://drone.io as one potential thing for local CI but haven’t really investigated anything thoroughly.

Anyhow I think it might be worthwhile to ask:

  • whether others also want/require purely on-premises testing
  • if so whether that can share infrastructure tooling with the sel4-projects testing infrastructure

I merely looked a little, but so far haven’t seen anything besides the buildkite agents being open source,
and didn’t find anything about installing the coordinator.

anyhow, that is just my 0x2 cents really.

Using cloud services should just be an option and exists in parallel to a well scripted pipeline that can run locally. I think that the possibility of on-premises testing is important, because:

  • it avoids a cloud service lock-in
  • no surprises with (hidden) costs, licensing terms or potential export restrictions
  • the availability is guaranteed, dependencies are clear and under full control
  • it allows using everything in air-gapped environments

We should clarify technically first, what the cloud services or various build robots provide that we don’t want to re-write or can solve with an offline tool. Next we should identify where we have to write connectors/adapters for a specific build robot that might result in a potential lock-in. Ideally, we always stick to the minimum requirement things can run well scripted (bash, python) a docker container on a Linux VM. The steps

  • environment step
  • checkout
  • build
  • test run preparation
  • test run
  • postprocessing
    are clearly separated. Each one can be tailored separately and has minimal hidden dependencies with the previous or next step. Especially, repeating (or even starting) the pipeline from any stage should be doable given a snapshot of the state is available.
1 Like

I want to write up a set of more coherent thoughts later, but right now I just have this to add:

PRs on GitHub should be allowed to be merged based on reviewer approval only.

  • Any errors in a particular PR may not be found until after it is merged and so merging a PR isn’t a guarantee of it being absolutely correct.
  • Any testing infrastructure that gets added and is required to pass before a PR can be merged introduces a failure-mode that can cause DOS-ing a PR being able to be merged.
  • The ultimate responsibility of minimizing errors is a combination of tooling and individuals reviewing and allowing the merge. Both individuals and tooling aren’t infallible and can make mistakes.
  • Reviewers are able to make much better judgements about whether a tool is working correctly or not.
  • Errors can always be removed by reverting or patching them.
  • We always have a trusted log of all changes to a repo via Git history so nothing is destroyed when a new change is merged.
  • In practice things end up being merged manually already: Remove HAVE_AUTOCONF guard in sel4/config.h by kent-mcleod · Pull Request #300 · seL4/seL4 · GitHub

I agree with some points for on-premise testing, but it is clearly infeasible for the foundation to do on-premises testing (having no premises or bigger infrastructure). In fact, it will likely become infeasible for anyone to do a full on-premises test of the entire ecosystem if nobody has the full set of all supported boards. D61 used to be able to do that, but there are now supported platforms that D61 doesn’t have hardware for.

That said it, it should of course remain possible to run any individual test on premises. In fact, I’d expect that to be the usual development cycle – you change one part of one repository, you test that change with what is relevant for the change, and then you submit a PR.

To enable that with different local test infrastructures, there should be a well-defined test environment, e.g. docker, and scripts that run everything that is needed. I think we are not very far away from that, and I wouldn’t want to move further away from it.

Things like buildkite or bamboo are meant to coordinate and ensure global consistency as well as release and deployment pipelines. The latter two are only necessary for the foundation. They should build on top of infrastructure that can run in isolation and on premises. I’m not too worried about lock-in for these higher-level coordination tasks, more about integration into the development cycle. For instance, moving from bamboo to something else is some work, but not a real problem. The problem is running the hardware tests, but that is because of network security setup (in D61 we’re not allowed to have these boards accessible to the internet), not because anything to do with bamboo as a product. In don’t see that this would be different with buildkite, or Jenkins, or anything really.

Ideally, whatever test provider we use (homegrown, cloud etc) I’d like to get to the stage where we can be sure of, or at least get a good indication of global consistency of all repo collections (manifests) directly from the tests that run on a GitHub PR. We currently don’t have that and never did. I don’t think it’s extremely hard to achieve.

The current process at D61 was/is for eventual consistency on the master branch, in the sense that a subset of the test runs on PR, and it is possible for instance for a PR merge in the seL4 repo to break sel4bench. This was necessary, because some changes will require multiple repos to change and we didn’t have a mechanism for making sure they change simultaneously (or even did not necessarily want them to change simultaneously). This is fine, because the repo manifests that include both of seL4 and sel4bench record which combination of versions was last known to work together, so you can always get a working combination. Additionally, the CI pipeline will only publish the seL4 and sel4bench repo master branches to GitHub when a new working combination has been found, e.g. when a second PR in sel4bench brings it back in sync.

The idea is that this happens quickly, i.e. within a few hours. This last bit is what hasn’t been working so well, sometimes taking much longer, because some hardware test on some board might be flaky or the network had an issue, or some other concurrent PR interfered with a separate issue. With hundreds of tests in total, some long-running, it’s not that unlikely that something glitch-able will glitch.

So we so far have had eventual consistency on the master branches of the individual repos, and strong consistency on the manifests, and (modulo manual pushes) only working versions visible on GitHub. The idea for that was that we didn’t want people who didn’t use repo to combine master branches and get confused on what works and what doesn’t.

If I understand Kent correctly, he is arguing that if we want to merge on Github, we should also just have eventual consistency on GitHub. I’m beginning to come around to that view, but I think it will be better if that consistency is achieved quickly, i.e. if infrastructure is in place that tells you what is working and what is not, and ideally already at the point where you make a PR (or simultaneous PRs). We could decide to not wait for that infrastructure, though, and build it up incrementally.

We can then separately talk about wether we should merge a PR that breaks something. It is entirely possibly to do that, and may be useful depending on the situation.

We’ll also need something that brings us back towards a working state instead of diverging further.

So, yes-ish, but I think it needs to be clear that this would be a rare exception. For instance, there is usually no reason to ignore gitlint, license checks, DCO, compile tests or things like that. For each individual one there are circumstances where it’s fine to ignore them, but they really should be exceptions. E.g if code doesn’t compile locally, I’m hard pressed to find a reason it should get in. But if that compile error is because some dependency isn’t in sync or something like that, then yes.

Yes, although that again argues for exception, not rule.

Conversely, a PR that breaks something can DOS everything, and can stop everyone making from making progress.

Basically, if it is clearly the test that is wrong, not the code, then I think it is absolutely fine to merge.

It is easy to be wrong about that, though. I did merge one of Benno’s PR manually a few months ago because it looked absolutely fine and Bamboo was stuck on something else. Turns out I needed to spend about 1 day of work on verification for it afterwards, because it did break a different architecture and the fix for that was verification visible. It’d have been much better do see that first and do the verification first.

Completely agree, and I think humans should have the last word. It should not be impossible to merge when things fail, just unusual.

Yes, I agree. The main scenario I worry about is for things to drift too far from each other over a longer time.

Yes, we never had a rule against that, and I don’t think we should introduce one. I do think we need the tooling to know what is happening, and to make the right decisions.

Of these steps Axel mentions:

I think environment, build, test run prep, and test run need to be able to run locally.

checkout is dependent on local infrastructure, and may or may not include pull requests, branches etc. post processing depends on the information you want out of the test, which may be just yes/no or it may be much more. Maybe there should be a defined interface.

Note that tests do not necessarily make sense for a single repo in the seL4 repo collection. Some tests can run on just seL4, e.g. a compile test, but a more meaningful test would be running sel4test in the repo collection defined by sel4test-manifest. The verification repo is bound to a specific version of seL4, as is most other stuff. We currently use repo to define what belongs together in a reasonably flexible and version-controlled way. The tests themselves will have to rely on some directory layout structure, but they should not rely on repo itself.

I think we have most of that in place, actually. There is more duplication that can be avoided, but compared to the actual test code + run, it is reasonably small.

I threw together some diagrams showing:

  • how I currently do on-premises SCM, through a pair of write-write channels which then gets synchronized to github… not sure how similar this is to what d61 is currently doing with jenkins/etc
    I imagine it mainly differs in that I consider github upstream & you consider github downstream (page 1),
  • buildkite-agent is kind of awkward in the topology when using on-premises SCM
    because it requires the on-premises SCM to communicate with buildkite servers and the on-premises agents to poll buildkite servers. (page 2).
  • The above precludes page 3, where you schedule jobs between the SCM a local job controller, and local agents (While the foundation uses a hosted controller).

In the network topology of pages 3 and 2 the differences really don’t impact the foundation or its consistency guarantees much. Its only once you have an isolated on-premises SCM, and you want to mirror the Foundation testing environment in isolation. In such cases I don’t expect to be able to run/leverage the buildkite-agents, leading to some duplication of effort and different tooling between downstream/upstream testing environments.

If there were something like page 3 which offered both a hosted and open source controller, it would seem to me at least the most flexible combination. Since downstream users could mirror the testing environment the foundation runs on AWS, substituting their own SCM & Job controller.

Anyhow i feel like i’m probably belaboring my point, but I would prefer my local testing environment to mirror as closely as possible the same as is used by the foundations, but the need for the local SCM to contact an external site via webhook in order to coordinate with local agents is perhaps a deal breaker.
Not really the end of the world though I guess…

Thanks for clarifying, I think I understand what you’re trying to achieve now. It would indeed be nice if there was a controller that could just be replicated on premises.

We currently don’t have that with bamboo, so I don’t think build kite would make anything worse – it’d be a step up, but not the ideal solution yet. I should look into drone.io more, though. If it achieves the same and can be deployed on premises, maybe that is a better solution.

If we find such a solution, though, is it likely that everyone will want to use it? I would imagine that if there is a larger organisation that this organisation will have some current infrastructure already, and it’d be more important to be able to port test to that platform than to replicate the foundation setup.

@axel you seem to be in a situation like that.

If you look at GitHub - seL4/ci-actions: CI GitHub actions for the seL4 repositories which is attempting to collect all checks that need to run in more than one repo – is that portable enough? I’ve been trying in there to a) isolate GitHub action syntax from the actual steps that the test needs to take, and b) (less successfully) to abstract github-specific steps in scripts that can be changed (i.e. fetch-base.sh for getting the base ref of a pull request could be replaced with something different that does the equivalent thing for a different testing platform, but it’d need some refactoring take explicit parameters instead of relying on GitHub environment variables).

What I usually try to avoid is powerful robot system with unclear cloud dependencies that are not even open source. Currently we have things running nicely with jenkins scripted pipelines, that is quite transparent and flexible (and sometimes limited and painful). I would not completely rule out setting up another build system for parts of the build/test orchestration, especially if there is a clear benefit and it does not put us on the mercy of some vendor eventually. It should just fit into the building-blocks approach where a GUI is just an add-on.
Another thing is, that at the core of all steps there should be scripts (bash, python …) that can also run stand-alone if chained together properly by some custom solution. Then a developer can set up a local pipeline easily that does a “one-command-build-and-test”. So working fully offline with a board conveniently is still possible.

Multi-Repo build pipelines are not really an issue, the repo tool or git submodules can be used here for example to link all of them. It just needs to be done and maintained by somebody. The more delicate issue here are reproducible builds - especially for older release branches months or years later. This can get tricky and I think that taking complete infrastructure snapshots is still the best option if you don’t have people taking care of keeping older branches working (buidling, testing). The core thing here seems noticing quickly when something in the infrastructure breaks compatibility, so people can work on handling this one way or the other. If there is no requirement for legacy support, things get much easier.

The master branch on github would essentially always be consistent as it would get updated when sel4test passes and would be pushed at the same time as the manifests. The master at D61 was also always had a linear history with all changes being added to the head, so if a particular commit was failing, any fixes would be added on top.

This setup enabled more flexibility in getting things merged and consistent without as much time spent waiting to order merges together. I think it would be useful to retain this and utilise it for external PRs on GitHub. We could achieve this by having a master/main head that acts as the current D61 master to which all merges are added and some passing branch that acts as master currently does on GitHub which strictly follows master and is updated to the latest passing commit when a combination of tests for a sel4test/sel4bench manifest passes.

So this would be eventual consistency for master and strong consistency for passing. We could do that. This is basically the same as the devel branch proposal, only the default target for PRs is master, not devel. Which is probably less confusing.

What do others think?

We’ll still have the problem that it would be possible to diverge too far on master, but we’ve always had that.

1 Like

On Github I’d like to see two things:

  • a “stable cutting edge” version that I can check out. Likely that would be master/main.
  • the next candidate in the pipeline, so it’s clear what will come next. This is a kind of “staging”.

PR should always target master/main, I thing that is common practice.

Ideally, master is something that can be re-tagged as a release later also, after more qualification. Or there is a branch trailing off then to make this even “better”. That is why I’m not sure about the passing branch/lable. Having that would depend on how things really work at the moment and how much qualification is happening in the background for things that are considered to become master. How long does it take a PR from entering the Data61 internal pipeline to getting visible in master in Github? And how long will this take in the future?

Also, we should not mix passing with a list of “non-standard” things that have been verified to work, e.g. rarely used board, long running test, obscure configuration. Here we will never have one “passing” stamp, but a lot of stamps that show something got tested somewhere - or flagged as failing somewhere else.

As a side note, what I do not really like are personal development branches on the “official seL4 Foundation repos”. There should only be “staging”, “master” and the past releases. Official feature branches can exist also, but it should be clear if they are still active and not a long obsolete development dead ends. Or some forgotten intermediate branches. For everything else, repo forks should be used and PR should be filed from them.

This sounds contradictory to me, but maybe I’m misinterpreting “stable”. PRs cannot go to a branch that I would consider stable (having a globally consistent head), that’s the whole problem. If we could have that, we wouldn’t need a distinction.

If we introduced a staging branch, PRs would need to go there first, but I do agree that the common practise is that PRs go to the master branch, which means a staging branch would be confusing for people who have missed that bit of information.

The nice thing about passing is that it would always be a strict postfix of master (as GitHub master is currently a strict postfix of bitbucket master), and nobody but CI would be able to commit to it directly. Releases can then always be made from passing at any time (maybe we should just call it release, but that’s easily changeable).

It currently takes anywhere from 2h to multiple weeks, depending on what breaks and if there are other infrastructure problems (there are plenty of hardware and infrastructure problems right now because of a forcedly hurried office move).

The idea is that merges to master would in the future be immediate, and the passing postfix branch, whatever we call it, would update in hours, maybe days (if there are greater changes between multiple repos).

My take is that passing should be for the list of officially supported platforms. If it is not officially supported, we should see a test result (if we can), but it won’t stop anything. If it is officially supported by the foundation or other platform owners, then it’ll have to pass first, before the passing branch pointer is advanced.

1 Like