Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: btrfs multidisk volume support #838

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

tbzatek
Copy link
Member

@tbzatek tbzatek commented Jan 25, 2021

This adds basic support for btrfs multidisk volumes. Changes are done in a general approach, supporting primary identifiers other than the traditional unix block device numbers (dev_t, major:minor). For btrfs such primary identifier becomes a udev ID_FS_UUID string, other filesystems are TBD but relatively easy to add (e.g. f2fs). Decision which approach to use is currently based on filesystem type. As a result this forms a virtual group of block devices carrying the particular identifier inside the daemon that are updated together. For simplicity this grouping is denoted a volume or a filesystem volume.

This however brings some caveats for this limited set of filesystems:

  • device numbers are not used anymore even for single device filesystems (as it's often difficult or impossible to distiguish how many components are needed prior to mounting)
  • block devices carrying matching identifiers will be treated as part of the volume even if they don't belong to it anymore (think e.g. about a precise bit copy - duplicate identifiers, underlying RAID array split-brain situation, etc.)

As this primarily affects mount points and mount operations only (and the UDisksState machinery in the background), the rest of the functionality remains available without major changes. I.e. block object representation doesn't change, operations on the org.freedesktop.UDisks2.Block interfaces are still available for the particular volume components, etc. As long as the ID_FS_USAGE=filesystem udev property is exposed on such block devices there's no point for breaking any existing functionality. And I certainly believe filesystem authors were fully aware of consequences by presenting this attribute.


This partly fixes #802 by denying multiple mounts for a given volume, no matter on which actual block object the org.freedesktop.UDisks2.Filesystem.Mount() method is called. The org.freedesktop.UDisks2.Filesystem interface properties should now be updated consistently across all objects for a given volume.

btrfs subvolume mounting (#768) is still broken at this point, this needs to be handled along with allowing multiple mounts in general, for any filesystem.

I'm still not 100% sure this is the right approach and whether further iterations of this design should follow or not. My attempt was to fix https://gitlab.gnome.org/GNOME/gvfs/-/issues/519 in a sane way to prevent creating new mounts as a result of non-matching device number against an active mount.

Resolves #802

@tbzatek
Copy link
Member Author

tbzatek commented Jan 25, 2021

(the first two commits are from #819, need to merge and rebase, along with an additional change I was planning)

@cmurf
Copy link

cmurf commented Jan 26, 2021

device numbers are not used anymore even for single device filesystems (as it's often difficult or impossible to distiguish how many components are needed prior to mounting)

Each member device has a superblock which contains num_devices. This can be parsed from btrfs inspect-internal dump-super output, but it might be preferable to enhance libbtrfsutil C API and python bindings to do this; or maybe also libblockdev.

block devices carrying matching identifiers will be treated as part of the volume even if they don't belong to it anymore (think e.g. about a precise bit copy - duplicate identifiers, underlying RAID array split-brain situation, etc.)

If the effective mount command is by UUID, i.e. /dev/disk/by-uuid/ then I think this problem can be ignored because the kernel is expected to deconflict, with btrfs:harden agaist duplicate fsid on scanned devices.

@tbzatek
Copy link
Member Author

tbzatek commented Jan 27, 2021

Each member device has a superblock which contains num_devices. This can be parsed from btrfs inspect-internal dump-super output, but it might be preferable to enhance libbtrfsutil C API and python bindings to do this; or maybe also libblockdev.

That's exactly the kind of information that should be exposed via udev attribute in the first place.

block devices carrying matching identifiers will be treated as part of the volume even if they don't belong to it anymore (think e.g. about a precise bit copy - duplicate identifiers, underlying RAID array split-brain situation, etc.)

If the effective mount command is by UUID, i.e. /dev/disk/by-uuid/ then I think this problem can be ignored because the kernel is expected to deconflict, with btrfs:harden agaist duplicate fsid on scanned devices.

The /dev/disk/ symlinks are just pointing to real device nodes. Mounting itself is a pretty complex operation, see e.g.
https://cdn.kernel.org/pub/linux/utils/util-linux/v2.36/libmount-docs/libmount-Library-high-level-context.html

But yes, I think the mnt_context_set_source() that is called through bd_fs_mount() can accept strings like UUID=xxxxxxxxxxx instead of a device file. Good idea, added as a card into https://github.com/storaged-project/udisks/projects/3

@tbzatek
Copy link
Member Author

tbzatek commented Jan 29, 2021

Pushing this back to udisks-2.10 as this will require careful security audit. Blindly adding any btrfs filesystem with matching UUID to an internal mountpoint group is vulnerable to external attacks (forged USB stick, gaining seat privileges automatically). As UDisks operates on block devices, exposing block objects, an exact list of block devices involved in an active btrfs mount is a requirement to ensure reasonable level of security.

@tbzatek
Copy link
Member Author

tbzatek commented Feb 2, 2021

Structure of an active mount looks to be exposed on /sys/fs/btrfs/<FS_UUID>/devices which is exactly I was looking for. This will solve the grouping concern.

@cmurf
Copy link

cmurf commented Feb 2, 2021

Structure of an active mount looks to be exposed on /sys/fs/btrfs/<FS_UUID>/devices which is exactly I was looking for.

I forgot to mention the sysfs stuff, sorry. The problem is this info isn't available in sysfs until the file system is mounted. I'm not sure how this works for mdadm raid, whether or how udev figures out how many devices there are, but maybe the kernel can be enhanced to expose limited info like "number of devices" in sysfs even if they aren't all yet available or mounted? Is that useful?

Related:
https://lists.freedesktop.org/archives/systemd-devel/2021-January/045922.html

@tbzatek tbzatek changed the title btrfs multidisk volume support WIP: btrfs multidisk volume support Feb 3, 2021
@tbzatek
Copy link
Member Author

tbzatek commented Feb 3, 2021

FYI the test_btrfs.UdisksBtrfsTest.test_add_remove_device test is failing on F33 + 5.10.11-200.fc33, looks like triggering uevent on a removed volume and the parent one after device removal (btrfs device delete) wouldn't make btrfs update its internal counters since btrfs filesystem show would show outdated number of devices. Adding 1 sec. sleep solves the issue. Of course we're not adding such lame workaround, this needs to be looked at in detail and reported to btrfs devs. TODO. This used to be working fine on F32 (probably 5.9 kernel).

Perhaps rewriting bd_btrfs_filesystem_info() and the whole libblockdev plugin to libbtrfsutil (storaged-project/libblockdev#552) or even reading the information from sysfs could improve things here. The struct BDBtrfsFilesystemInfo is fairly simple. TODO#2.

@tbzatek
Copy link
Member Author

tbzatek commented Feb 3, 2021

I forgot to mention the sysfs stuff, sorry. The problem is this info isn't available in sysfs until the file system is mounted. I'm not sure how this works for mdadm raid, whether or how udev figures out how many devices there are, but maybe the kernel can be enhanced to expose limited info like "number of devices" in sysfs even if they aren't all yet available or mounted? Is that useful?

I must've overlooked the sysfs stuff the first time I was looking for any suitable information there. Or perhaps I didn't have any active mounts back at that time...

I guess for the purposes of identifying exact block devices involved in a mounted filesystem this is sufficient, for the non-mounted scenario not all information are necessary at that point and the bd_btrfs_filesystem_info() calls reading the superblock is just fine.

The purpose for the mounted scenario is to deny potentially destructive operations on the org.freedesktop.UDisks2.Block interface on block devices that are part of a mounted filesystem while allowing that for others even though they carry the same (duplicate) ID_FS_UUID yet are either kicked out from the filesystem/volume or otherwise not activated. And for this purpose we needed exact kernel block device numbers.

For the non-mounted scenario callers are free to destroy any component of a multidisk btrfs filesystem as they wish
(still gated by Polkit auth) so no runtime info is needed.

Related:
https://lists.freedesktop.org/archives/systemd-devel/2021-January/045922.html

Thanks for pushing this upstream. I think my former kdave/btrfs-progs#302 request looks obsolete now, still need to check whether we have all the information we need before closing that one.

@tbzatek tbzatek removed this from the udisks-2.10.0 milestone Nov 8, 2021
@tbzatek tbzatek mentioned this pull request Jan 5, 2022
This should be working properly now with uevent synthetic tagging in use.
This should be a common function to check whether the filesystem
is a traditional one, device number based, or whether it is a filesystem
of a different concept, based e.g. on 'volumes' such as btrfs, and
to indicate that any filesystem matching should be done some different
way, e.g. by filesystem UUID.
…ed filesystems

For filesystems with concept of 'volumes' the device number of an active
mount often doesn't match the real backing block device. A different
approach must be taken to match the mount with a block device. The
generic ID_FS_UUID udev attribute looks like a good candidate and
it seems to be widely used nowadays. For the moment this approach is
only used for btrfs devices.

The downside of such approach is grouping of all devices carrying
the particular attribute value under one volume, no matter whether
they're actually attached to an active mount or if it's just a copy
(duplicate UUIDs caused by full byte copy of a device).
This extends the on-disk mount state file structure with filesystem UUID
if available or requested. This is an initial support that might be
further extended to consider other block devices in the group
(filesystem volume or an active mountpoint). For now the cleanup takes
a more conservative approach in such case.
This returns a list of UDisksObjects that are part of the multidisk volume
the current UDisksFilesystem interface points to.
… multidisk volume

Wired for the Mount() and Unmount() methods so far this will trigger additional
uevents to the rest of the block devices that form the particular "volume".
This will ensure the properties on all block objects that form
a multi-disk btrfs volume are properly updated.
@tbzatek
Copy link
Member Author

tbzatek commented Dec 7, 2023

Some more ideas:

  • introduce new abstraction UDisksDeviceIdentifier or similar - a simple struct that would hold traditional major:minor numbers as well as volume UUID or any arbitrary unique identifiers. Should be future-proof in case another logical-volume-manager-like filesystem is invented.

@Managor
Copy link

Managor commented Aug 24, 2024

What is the status of this PR?

@cheako
Copy link

cheako commented Aug 24, 2024

What is the status of this PR?

I'm not sure, but I do think because of it's size(is it a dozen pages?) it should be scrapped. Also this PR doesn't address the parent issue leaving every other user of libblkid(util-linux) to also implement a solution leading to duplication of code... So not only is the patch large it causes an even greater burden to FOSS as a whole. There is a patch, it's about 200 lines, to libblkid that fully addressed the issue for me. util-linux/util-linux#1382 (comment) It's so small I won't bother with a PR, the commit messages alone would delete more time than the patch.

@tbzatek
Copy link
Member Author

tbzatek commented Aug 26, 2024

What is the status of this PR?

Not entirely abandoned yet no free time to work on this.

With the invention of bcachefs along with established multi-volume capable filesystems like btrfs, f2fs and zfs, this will need to be taken more holistically to bring good enough abstraction for unique device identifiers. Lots of work, not enough time to work on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

how to handle Btrfs multiple devices on the desktop
4 participants