device mapper! (dmsetup, etc.) (bits from BALUG meeting 2024-11-19) - BALUG-Talk

26 Nov 2024


      So, 2024-11-19 BALUG meeting, one of our discussion topics:
device mapper - and dmsetup(8) and related.
So, device mapper - it is used to create/manage block device, which in
turn has a specification of how it blocks are mapped to zero or more
other device(s).  It operates in units of traditional 512 byte blocks,
and can handle a quite wide variety of possible ways of doing mapping.
It has at least:
cache, clone, dust, crypt, delay, ebs, error, flakey, linear, mirror,
multipath, raid, snapshot, striped, thin, zero
Some of the bits I mentioned.  It's relatively lower level, but many
other services and the like leverage the device mapper.  E.g. LVM uses
device mapper for its lower level configuration, as does cryptestup e.g.
for LUKS.
Sometimes using device mapper more directly can be quite useful.  E.g.
want to test I/O issues, it has dust, error, and flakey, so one can set
up device that gives errors on I/O based upon various criteria as
specified.  Can also specify different specific data to be read back -
which may be different than that which was written, etc.
Also has RAID capabilities - which brings me to example.  Some while
back, had case to assist someone in coming up with a solution to
something they wanted to do.  They had a fair sized hardware RAID array,
RAID-5 with 4x12TB drives.  They wanted to migrate to md raid5, with
4x12TB (new) drives.  And they wanted to minimize downtime to the extent
feasible.  Well, easiest way to do that, would be to effectively
layer RAID-1 - at least temporarily atop that - sync - then split that
mirror.  But LVM or md, etc. RAID-1 not so great for that - as they'd
all generally want to write their relevant headers on the devices, etc.,
and at best non-trivial to write that data, and block status tracking
data, etc., somewhere else - if they even support that at all.
So, solution for that?  Use device mapper.  Can then just directly raid1
mirror the blocks of the two devices - the original RAID-5 hardware RAID
device, and the newer replacement md raid5 software RAID device.
So, first of all, documentation, etc.  There's the dmsetup(8) man page.
Pretty good, but it leaves a lot to be desired.  Notably doesn't well
cover a lot of details that are or may be necessary.  So next stop - and
what it quite refers to - kernel documentation. E.g.:
file:///usr/share/doc/linux-doc-6.1/html/admin-guide/device-mapper/
etc.  Pretty good ... but not so great on, e.g. more complete examples,
etc. and even some relevant information turned out to be quite missing
from the kernel documentation (though hey, could probably read the
relevant source ... but that generally wouldn't be easiest way).
So next, some bits of Internet searching ... and found some good
resources, e.g.:
https://wiki.gentoo.org/wiki/Device-mapper#Mirror_and_RAID1
So, at least taken together, there was sufficient information.
So ... I earlier did a test demo run, to show how it could be done ...
but didn't save all my information/notes on that, so let me repeat that.
And a bit better this time - including log devices - that way even if,
e.g. system were to crash while the sync was in progress, could be
cleanly resumed.
So, I don't have 8x14TB in spare drives sitting around, so I'll do much
smaller with some space on /tmp and some files there and using losetup(8)
to create block devices suitable for use.  So, below, my comments on
lines starting with //:
// So, don't have 8x14TB to work with, but on /tmp at present, can
// easily spare 64GiB. So will instead do ~24GiB to emulate the old and
// 4x8GiB to emulate the new.  Since the "old" source is hardware
// RAID-5, will just do appropriately sized storage for that, as I don't
// have hardware RAID-5 available for that.  First let's create the
// backing files and loop devices for our first 4 devices to represent
// our target drives.
# mkdir /tmp/dmr1 && cd /tmp/dmr1
# truncate -s $(expr 8 * 1024 * 1024 * 1024) f{1,2,3,4} && (for n
in 1 2 3 4; do losetup -f --show f"$n"; done)
/dev/loop2
/dev/loop3
/dev/loop4
/dev/loop5
#
// And now create our software md raid5 device:
# mdadm --create --level=raid5 --raid-devices=4 /dev/md24 /dev/loop[2-5]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md24 started.
#
// And now let's get its exact size:
$ cat /sys/block/md24/size
50276352
$
// That's in 512 byte blocks.  As for bytes
$ expr 50276352 * 512
25741492224
$
// So let's create our source device of exactly that size:
# truncate -s 25741492224 f0 && losetup -f --show f0
/dev/loop6
#
// In our actual case, source would need to be not larger than the
// target.  If source were larger, we'd need to shrink the (relevant)
// data before copying, e.g. reduce the size of the filesystem a bit,
// possibly repartition slightly, etc.
// And let's confirm our sizes match:
$ cmp /sys/block/md24/size /sys/block/loop6/size && echo MATCHED
MATCHED
$
// And let's create some data on our source device:
# mkfs -t ext3 -L 24gr5 -m 0 /dev/loop6 && mount -o nosuid,nodev
/dev/loop6 /mnt && { dd if=/dev/urandom of=/mnt/urandom bs=1048576
status=none; < /mnt/urandom sha512sum && umount /mnt; }
// ...
dd: error writing '/mnt/urandom': No space left on device
6e0487bed425a7bb667d169e001415a4a18c6413ee5be56f032ebac7ea827dae9caee9ab0d0801e1d1b537eabc75cee9842de00a1089fd0afd7e7630752128aa
 -
#
// Now lets set up our device mapper device, will do this with metadata
// devices to track, so, e.g. if interrupted (e.g. system crash
// or abrupt power down, or drives disconnected, etc.) can still safely
// resume after (we'll essentially ignore that our backing store is on
// the volatile /tmp - this is still just demo after all).
// Unfortunately the kernel documentation doesn't say how large these
// need to be.  Probably relatively small % of the total space, I'll
// give it more than ample space on sparse file, then we can look at
// actual block usage after.  So, devices 50276352 512 byte blocks,
// let's say 5% of that - that's ought be much more than enough - but
// sparse files, won't much matter.
$ echo '50276352*512/20' | bc -l
1287074611.20000000000000000000
$
// And let's round that up to 4KiB boundary:
$ echo '1287074611.2/4/1024' | bc -l
314227.20000000000000000000
$ expr 314228 * 4 * 1024
1287077888
$
# truncate -s 1287077888 m{0,1}
# (for n in 0 1; do losetup -f --show m"$n"; done)
/dev/loop7
/dev/loop8
#
// Will also look at status right after creation and 30 seconds later,
// and will mount right after we create it too:
# dmsetup create dmr1 --table '0 50276352 raid raid1 5 0 region_size
32 rebuild 1 2 /dev/loop7 /dev/loop6 /dev/loop8 /dev/md24' && dmsetup
status dmr1 && mount -o nosuid,nodev /dev/mapper/dmr1 /mnt && sleep 30
&& dmsetup status dmr1
0 50276352 raid raid1 2 Aa 0/50276352 recover 0 0 -
0 50276352 raid raid1 2 Aa 2046848/50276352 recover 0 0 -
#
// And a while later we have:
# dmsetup status dmr1
0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 -
#
// That field of all uppercase "A" characters tells us the RAID-1
// devices are fully synced up.
// Let's deconstruct our RAID-1 device and compare the files on the
// filesystems - which have now been copied via mirroring.  Will also
// read and recompute hash of one of the files to also check that still
// matches.  Also, so they don't conflict on the filesystems, will
// change the label and UUID on the "old" one, so the original data of
// that remains on the "new" target one.
# umount /mnt
// check again that we're synced before removal:
# dmsetup status dmr1
0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 -
# dmsetup remove dmr1
# tune2fs -L 24gr5.old -U random /dev/loop6
tune2fs 1.47.0 (5-Feb-2023)
# mkdir mnt-old mnt-new
# mount -o ro,nosuid,nodev /dev/loop6 mnt-old
# mount -o ro,nosuid,nodev /dev/md24 mnt-new
# cmp mnt-{old,new}/urandom && echo MATCHED
MATCHED
# < mnt-new/urandom sha512sum
6e0487bed425a7bb667d169e001415a4a18c6413ee5be56f032ebac7ea827dae9caee9ab0d0801e1d1b537eabc75cee9842de00a1089fd0afd7e7630752128aa
 -
#
// And we can see that the files match and the hash matches our earlier.
// Let's do it one more time, except this time with the filesystem very
// busy while it's mounted and doing the RAID-1 sync.
// And this time we mirror from the new, to old, as the new now has
// exactly the data we want.
# umount mnt-old && umount mnt-new
# dmsetup create dmr1 --table '0 50276352 raid raid1 5 0 region_size
32 rebuild 0 2 /dev/loop7 /dev/loop6 /dev/loop8 /dev/md24' && mount -o
nosuid,nodev /dev/mapper/dmr1 /mnt && { dd if=/dev/urandom
of=/mnt/urandom bs=1048576 status=none; < /mnt/urandom sha512sum; }
dd: error writing '/mnt/urandom': No space left on device
c818fb535d037b01868252a3f2464cc17fa70b8f4cb21436a0f7d3d9c85b4783ac7e7835f47d55c588287688013b687743886f5640ddfb91ff9e2f8177dd5b38
 -
#
// We check until we see it's synced:
# dmsetup status dmr1
0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 -
#
// Then we unmount, and again reconfirm it's synced:
# umount /mnt
# dmsetup status dmr1
0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 -
#
// Now we again deconstruct the RAID-1, update label and UUID on old,
// and mount and compare, and also again compute hash on one of the
// files to see that also matches:
# dmsetup remove dmr1
# tune2fs -L 24gr5.old -U random /dev/loop6
tune2fs 1.47.0 (5-Feb-2023)
# mount -o ro,nosuid,nodev /dev/loop6 mnt-old
# mount -o ro,nosuid,nodev /dev/md24 mnt-new
# cmp mnt-{old,new}/urandom && echo MATCHED
MATCHED
# < mnt-old/urandom sha512sum
c818fb535d037b01868252a3f2464cc17fa70b8f4cb21436a0f7d3d9c85b4783ac7e7835f47d55c588287688013b687743886f5640ddfb91ff9e2f8177dd5b38
 -
// So, files again matched, and our hash again matches.  Also, checked
// how much space was actually used for the meta devices to track RAID-1
// status:
# stat -c '%b %n' m[01]
400 m0
400 m1
#
// So that's 400 512 byte blocks, 200 KiB.
// That size apparently depends upon size of device and region_size,
// and does apparently have a limit on total maximum size it will use
// for that meta device.
So, our original scenario to replicate the RAID-5 from hardware RAID to
software (md) RAID while minimizing downtime would go about like this:
o create the new target md device, note its precise size
o stop all I/O on the old device (e.g. unmount it).
o note size of old - must not be larger than new, if necessary shrink
  and/or repartition, etc. or the like as appropriate.
o create dm device as RAID-1 between the old and new, syncing to new
o return I/O to service (using the dm device in place of the old device)
o after sync has completed (may take hours to days for larger/slower
  storage), as before, stop all I/O - except now on the dm device
  instead of the old device.  Again check/wait until it's fully synced.
o tear down the dm device
o Adjust labels, UUIDs, etc. if/as applicable on old to not conflict
  with new
o return I/O to service, using new in place of old
o if/as desired, tear down or decommission old
Key advantage is relative minimization of downtime - across the hours
(or days or more) while the RAID-1 mirror is syncing to duplicate the
storage data, no downtime is needed, and the storage can be used per
usual.  A bit of downtime to shuffle about, notably from old, to dm,
then to new, but other than that, things generally remain online and
actively available.  Additionally, no headers or encapsulation need
happen on the old or new storage itself - that's all handled
externally, so that keeps it clean and relatively simple.