Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow initrd configuration to be skipped #165

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

TheMysteriousX
Copy link
Contributor

Enhancement:
Allow the initrd and network manager/dracut flush module mechanisms to be disabled.

Reason:
We have volumes that are unlocked by clevis-luks-askpass late in the boot process after NetworkManager has put the system on the network, so no changes to the initrd are needed.

The affected systems we have this arrangment on have complicated network setups (bonds, macsec, static addressing, IPv6), so the role actually breaks the boot process for them, as it does not account for anything except single NIC + DHCP + IPv4.

Result:
User can disable initrd configuration if required, supporting advanced network configuration to be used or decryption to occur late in the boot process.

Issue Tracker Tickets (Jira or BZ if any):

tasks/main-clevis.yml Outdated Show resolved Hide resolved
tasks/main-clevis.yml Outdated Show resolved Hide resolved
@richm
Copy link
Contributor

richm commented Jun 26, 2024

@TheMysteriousX can you give me an example of specific use case for this?

@sergio-correia does this make sense to you?

@sergio-correia
Copy link
Member

@TheMysteriousX can you give me an example of specific use case for this?

@richm: I will let @TheMysteriousX elaborate on the use cases, but basically unlocking may happen in 2 different moments: first during early boot, in which case we need to amend the initrd to e.g. setup networking, when required, and then in late boot, after the system has switched from the initrd to the actual root filesystem (switch-root phase). At this point, the system uses its "regular" configuration.

For some devices, such as the root device (/) or the swap for instance, unlocking needs to happen in early boot, so that the system can continue to boot; for some other devices, e.g. an encrypted /opt, its unlocking happen in late boot. If one only has devices that unlock in late-boot, there is no need to have changes to the initrd, which in this case seems to be causing issues with the regular system configuration after the switch-root phase.

@sergio-correia does this make sense to you?

Yep, it does.

defaults/main.yml Outdated Show resolved Hide resolved
@richm richm changed the title Allow initrd configuration to be skipped feat: Allow initrd configuration to be skipped Jun 26, 2024
@richm
Copy link
Contributor

richm commented Jun 26, 2024

If you rebase on top of the latest main branch, that will fix the ansible-lint issue.

@TheMysteriousX
Copy link
Contributor Author

Thanks for the review, all the proposed changes look good - I'll implement them and rebase as suggested.

can you give me an example of specific use case for this?

Sure - we keep the base OS volume unencrypted and attach additional encrypted data volumes separately to VM's because:

  • eases deployment, we don't need to figure out how to get a VM booted from an initial key then overwritten later
  • we have a deduplicating storage backend with SED's so do not need full disk encryption, just encryption of sensitive data
  • full disk encryption is really hard to debug and fix when it goes wrong, additional volumes can just be disabled if they fail to decrypt leaving the system bootable

So to keep things simple, we modified the nbde role to not configure the initrd and flush service - the encrypted volume gets unlocked and mounted after NetworkManager has started instead of before. Systemd supports some additional directives for making sure processes don't get started before a required volume is online which we set with linux-system-roles/storage.

This is also advantageous because the role does not support static addressing as additional parameters are required for that. Any host that contacts the tang hosts via e.g. WiFi, Cellular, bonded interfaces, dynamic routing, IPsec, Wireguard, PPPoE is similarly not supported - I'm not aware of any simple way to support these, other than not decrypting them with the initrd.

Ideally when using the initrd to decrypt volumes, you also need to run an sshd so that the host can be manually decrypted remotely if the tang hosts were to fail - but I don't think this is supported on RHEL at all.

TheMysteriousX and others added 2 commits June 27, 2024 23:32
@richm
Copy link
Contributor

richm commented Jun 27, 2024

[citest]

2 similar comments
@richm
Copy link
Contributor

richm commented Jun 28, 2024

[citest]

@richm
Copy link
Contributor

richm commented Jul 1, 2024

[citest]

@richm
Copy link
Contributor

richm commented Jul 1, 2024

@sergio-correia how can we have automated tests for this new feature? is it possible?

@sergio-correia
Copy link
Member

@sergio-correia how can we have automated tests for this new feature? is it possible?

Yeah, it is possible, but not very simple. Basically we can have a VM provisioned having an encrypted /data device, then we use the role to set up clevis and then reboot the machine to verify whether it unlocked the device successfully.

Long ago, when there was travis-ci (and it supported nested virt), we had a test on the clevis upstream repository that would do something along those lines for each PR: provision a VM with a kickstart, set up clevis, reboot and verify the expected outcome. When we moved to github-actions, I remember it did not support nested virt and we ended up removing that test.

@richm richm merged commit 744cd4d into linux-system-roles:main Jul 1, 2024
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants