We remove the global rootlog in favor of instantiating the logger as
required in the __init__.py and pass it down as a parameter (of our
AbstractLogger type).
Previously, the XML logging was always present and only created an
output file if a special environment variable was present. We now only
create the XML logger if the environment variable is present, saving us
from logging to XML internally if it is not required.
We add a new logger that allows generating a junit-xml compatible report
listing the subtests used in the nixos integration test. Junit-xml is a
widely used standard for test reports. The report can be used for quick
evaluation of which subtest failed.
We use the newly AbstractLogger class and separate the XML and Terminal
logging that is currently mixed into one class. We restore the old
behavior by introducing a CompositeLogger that takes care of logging
both to terminal and XML.
We do not use the generic "nested" function but introduce a separate
subtest log call. This will later allow us to track subtests and account
logs to specific subtests.
As the TODO says, this is already included by the script.
If adding a device, including this again here would result in either
two devices being added, or, if they were explicitly named, an error
due to reuse of the name.
Adds a function to wait for a new QMP event with a model filter
so that you can expect specific type of events with specific payloads.
e.g. a guest-reset-induced shutdown event.
From now on, we will aim to ensure that the test driver
gets tested by OfBorg using all our available tests.
This commit adds the driver timeout test to the driver.
Since the debut of the test-driver, we didn't obtain
a race timer with the test execution to ensure that tests doesn't run beyond
a certain amount of time.
This is particularly important when you are running into hanging tests
which cannot be detected by current facilities (requires more pvpanic wiring up, QMP
API stuff, etc.).
Two easy examples:
- Some QEMU tests may get stuck in some situation and run for more than 24 hours → we default to 1 hour max.
- Some QEMU tests may panic in the wrong place, e.g. UEFI firmware or worse → end users can set a "reasonable" amount of time
And then, we should let the retry logic retest them until they succeed and adjust
their global timeouts.
Of course, this does not help with the fact that the timeout may need to be
a function of the actual busyness of the machine running the tests.
This is only one step towards increased reliability.
Now that we have a QMP client, we can wire it up in the test driver.
For now, it is almost completely useless because of the need of a constant "event loop", especially
for event listening.
In the next commits, we will slowly enable more and more usecases.
When listening on unix sockets, it doesn't make sense to specify a port
for nginx's listen directive.
Since nginx defaults to port 80 when the port isn't specified (but the
address is), we can change the default for the option to null as well
without changing any behaviour.
This also makes configuration available if you just run those tools locally.
Also use ruff instead of pylint because it's faster and more
comprehensive.
Since 008f9f0cd4
("nixos/test-driver: actually use the backdoor message to wait for backdoor"),
when boot is still computering, we can get a tons of empty strings in response to the shell.
This is not really useful to print and waste the disk space for any CI system that logs them.
We stop logging chunks whenever they are empty.
While working on #192270, I noticed that only some wait_for_* helper
functions make the timeout configurable. I think we should be able to
customize it in all cases
New EDK2 sets up the backdoor port as a serial console, which feeds the test driver
a bunch of boot logs it can safely ignore. Do so by waiting for the message the
backdoor shell prints before doing anything else.
By some miracle, before, it was possible to reconnect to the `node1` without
doing any relevant dance.
But now we are direct booting (¿), it seems like we need to do the right things.
This introduces a `check_output` flag for `execute` because we do not want to steal the
messages from the backdoor service as we might execute the kexec too fast compared
to when we will reconnect.
Therefore, we will let the message in the pipe if needed.
- `wait_until_fails` was not passing through its `timeout` argument to
the internal `retry` function, hence was always using 900 seconds (the
default timeout for `retry`) rather than the user-specified value.
Previously, `wait_for_console_text` would block indefinitely until there were lines
shown in the buffer.
This is highly annoying when testing for things that can just hang for some reasons.
This introduces a classical timeout mechanism via non-blocking get on the Queue.
This is useful whenever you want to diagnose the current state of UEFI
variables, to assert that bootloaders or boot programs (systemd-stub)
did their job correctly and set their variables accordingly.
In the future, it can enable inspecting SecureBoot keys also.
This warning was added a year and a half ago, but still no test in
NixOS directly instantiates the machine class, presumably because it's
not actually possible for a test to do so without losing
functionality. For example, there's no way for a NixOS test to access
the output directory that create_machine passes to the Machine
constructor.
This warning is therefore just contributing to alert fatigue for
users, who are unable to follow its advice. Once it's actually
possible to do what it suggests, the warning can be reintroduced.
What the code was trying to do was helpfully add a directory and
extension if none were specified, but it did this by checking whether
the filename was composed of a very limited character set that didn't
even include dashes.
With this change, the intention of the code is clearer, and I can put
dashes in my screenshot names.
The output of a command is not guaranteed to be valid UTF-8, so the
decoding can fail raising UnicodeDecodeError. If this happens during a
`succeeds` the check will be erroneously marked failed.
This changes the error handling to the "replace" mode, where invalid
codepoints are replaced with � (REPLACEMENT CHARACTER U+FFFD) and the
decoding can go on.
`shell_interact()` is currently not nice to use. If you try to cancel
the socat process, it will also break the nixos test. Furthermore
ptpython creates it's own terminal that subprocesses are running in,
which breaks some of the terminal features of socat.
Hence this commit extends `shell_interact` to allow also to connect to
arbitrary servers i.e. tcp servers started by socat.
checkInputs used to be added to nativeBuildInputs. Now we have
nativeCheckInputs to do that instead. Doing this treewide change allows
to keep hashes identical to before the introduction of
nativeCheckInputs.
A few places used Unicode U+2018/U+2019 left/right single quotes (but
not always correctly balanced). Let's just use plain ASCII single quotes
everywhere.
For example, the wait_for_unit() call in the Moodle test times out for
myself and others[1], so it would be good to be able to increase it to
something less likely to be hit by a test that would otherwise pass.
[1]: https://github.com/NixOS/nixpkgs/pull/177052#issue-1266336706
Within a dual VM test-setup a strange behaviour was observed.
The two VMs are connected via one vde_switch instance
(instancevirtualisation.vlans = [ 1 ]; IMO a bad attribute name for
switch instances, has nothing to do with VLANs in sense of 802.1Q).
A ping on the base interface (eth1) works, but not on VLAN
subinterfaces (vlan1@eth1). A tcpdump of eth1 includes the ARP requests
tagged with the subinterfaces VLAN ID, but responses seems not to pass
the vde_switch. This works fine if performed on the base interface.
Putting the vde_switch in hub mode results in flooding
traffic to all vde_switch ports. This results in a expected behaviour
and a ping on a VLAN subinterface works as expected.
Signed-off-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Without this fix, setting the shellopts in `machine.execute` is
inconsitent. When no timeout is used, shellopts `set -euo pipefail` are
applied to the command as expected. When a timeout is specified, the
shellopts are not applied to the command itself (which is called inside
a `sh -c` that doesn't inherit the shellopts) but rather to the
`timeout` command, leading to the following full command:
```bash
(set -euo pipefail; timeout 900 sh -c 'cmd') | (base64 --wrap 0; echo)\n
```
With this fix, this is the command we get:
```bash
timeout 900 sh -c 'set -euo pipefail; false | true') | (base64 --wrap 0; echo)\n
```