So, 2024-11-19 BALUG meeting, one of our discussion topics: device mapper - and dmsetup(8) and related.
So, device mapper - it is used to create/manage block device, which in turn has a specification of how it blocks are mapped to zero or more other device(s). It operates in units of traditional 512 byte blocks, and can handle a quite wide variety of possible ways of doing mapping. It has at least: cache, clone, dust, crypt, delay, ebs, error, flakey, linear, mirror, multipath, raid, snapshot, striped, thin, zero
Some of the bits I mentioned. It's relatively lower level, but many other services and the like leverage the device mapper. E.g. LVM uses device mapper for its lower level configuration, as does cryptestup e.g. for LUKS.
Sometimes using device mapper more directly can be quite useful. E.g. want to test I/O issues, it has dust, error, and flakey, so one can set up device that gives errors on I/O based upon various criteria as specified. Can also specify different specific data to be read back - which may be different than that which was written, etc.
Also has RAID capabilities - which brings me to example. Some while back, had case to assist someone in coming up with a solution to something they wanted to do. They had a fair sized hardware RAID array, RAID-5 with 4x12TB drives. They wanted to migrate to md raid5, with 4x12TB (new) drives. And they wanted to minimize downtime to the extent feasible. Well, easiest way to do that, would be to effectively layer RAID-1 - at least temporarily atop that - sync - then split that mirror. But LVM or md, etc. RAID-1 not so great for that - as they'd all generally want to write their relevant headers on the devices, etc., and at best non-trivial to write that data, and block status tracking data, etc., somewhere else - if they even support that at all.
So, solution for that? Use device mapper. Can then just directly raid1 mirror the blocks of the two devices - the original RAID-5 hardware RAID device, and the newer replacement md raid5 software RAID device.
So, first of all, documentation, etc. There's the dmsetup(8) man page. Pretty good, but it leaves a lot to be desired. Notably doesn't well cover a lot of details that are or may be necessary. So next stop - and what it quite refers to - kernel documentation. E.g.: file:///usr/share/doc/linux-doc-6.1/html/admin-guide/device-mapper/ etc. Pretty good ... but not so great on, e.g. more complete examples, etc. and even some relevant information turned out to be quite missing from the kernel documentation (though hey, could probably read the relevant source ... but that generally wouldn't be easiest way). So next, some bits of Internet searching ... and found some good resources, e.g.: https://wiki.gentoo.org/wiki/Device-mapper#Mirror_and_RAID1 So, at least taken together, there was sufficient information. So ... I earlier did a test demo run, to show how it could be done ... but didn't save all my information/notes on that, so let me repeat that. And a bit better this time - including log devices - that way even if, e.g. system were to crash while the sync was in progress, could be cleanly resumed.
So, I don't have 8x14TB in spare drives sitting around, so I'll do much smaller with some space on /tmp and some files there and using losetup(8) to create block devices suitable for use. So, below, my comments on lines starting with //:
// So, don't have 8x14TB to work with, but on /tmp at present, can // easily spare 64GiB. So will instead do ~24GiB to emulate the old and // 4x8GiB to emulate the new. Since the "old" source is hardware // RAID-5, will just do appropriately sized storage for that, as I don't // have hardware RAID-5 available for that. First let's create the // backing files and loop devices for our first 4 devices to represent // our target drives. # mkdir /tmp/dmr1 && cd /tmp/dmr1 # truncate -s $(expr 8 * 1024 * 1024 * 1024) f{1,2,3,4} && (for n in 1 2 3 4; do losetup -f --show f"$n"; done) /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 # // And now create our software md raid5 device: # mdadm --create --level=raid5 --raid-devices=4 /dev/md24 /dev/loop[2-5] mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md24 started. # // And now let's get its exact size: $ cat /sys/block/md24/size 50276352 $ // That's in 512 byte blocks. As for bytes $ expr 50276352 * 512 25741492224 $ // So let's create our source device of exactly that size: # truncate -s 25741492224 f0 && losetup -f --show f0 /dev/loop6 # // In our actual case, source would need to be not larger than the // target. If source were larger, we'd need to shrink the (relevant) // data before copying, e.g. reduce the size of the filesystem a bit, // possibly repartition slightly, etc. // And let's confirm our sizes match: $ cmp /sys/block/md24/size /sys/block/loop6/size && echo MATCHED MATCHED $ // And let's create some data on our source device: # mkfs -t ext3 -L 24gr5 -m 0 /dev/loop6 && mount -o nosuid,nodev /dev/loop6 /mnt && { dd if=/dev/urandom of=/mnt/urandom bs=1048576 status=none; < /mnt/urandom sha512sum && umount /mnt; } // ... dd: error writing '/mnt/urandom': No space left on device 6e0487bed425a7bb667d169e001415a4a18c6413ee5be56f032ebac7ea827dae9caee9ab0d0801e1d1b537eabc75cee9842de00a1089fd0afd7e7630752128aa - # // Now lets set up our device mapper device, will do this with metadata // devices to track, so, e.g. if interrupted (e.g. system crash // or abrupt power down, or drives disconnected, etc.) can still safely // resume after (we'll essentially ignore that our backing store is on // the volatile /tmp - this is still just demo after all). // Unfortunately the kernel documentation doesn't say how large these // need to be. Probably relatively small % of the total space, I'll // give it more than ample space on sparse file, then we can look at // actual block usage after. So, devices 50276352 512 byte blocks, // let's say 5% of that - that's ought be much more than enough - but // sparse files, won't much matter. $ echo '50276352*512/20' | bc -l 1287074611.20000000000000000000 $ // And let's round that up to 4KiB boundary: $ echo '1287074611.2/4/1024' | bc -l 314227.20000000000000000000 $ expr 314228 * 4 * 1024 1287077888 $ # truncate -s 1287077888 m{0,1} # (for n in 0 1; do losetup -f --show m"$n"; done) /dev/loop7 /dev/loop8 # // Will also look at status right after creation and 30 seconds later, // and will mount right after we create it too: # dmsetup create dmr1 --table '0 50276352 raid raid1 5 0 region_size 32 rebuild 1 2 /dev/loop7 /dev/loop6 /dev/loop8 /dev/md24' && dmsetup status dmr1 && mount -o nosuid,nodev /dev/mapper/dmr1 /mnt && sleep 30 && dmsetup status dmr1 0 50276352 raid raid1 2 Aa 0/50276352 recover 0 0 - 0 50276352 raid raid1 2 Aa 2046848/50276352 recover 0 0 - # // And a while later we have: # dmsetup status dmr1 0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 - # // That field of all uppercase "A" characters tells us the RAID-1 // devices are fully synced up. // Let's deconstruct our RAID-1 device and compare the files on the // filesystems - which have now been copied via mirroring. Will also // read and recompute hash of one of the files to also check that still // matches. Also, so they don't conflict on the filesystems, will // change the label and UUID on the "old" one, so the original data of // that remains on the "new" target one. # umount /mnt // check again that we're synced before removal: # dmsetup status dmr1 0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 - # dmsetup remove dmr1 # tune2fs -L 24gr5.old -U random /dev/loop6 tune2fs 1.47.0 (5-Feb-2023) # mkdir mnt-old mnt-new # mount -o ro,nosuid,nodev /dev/loop6 mnt-old # mount -o ro,nosuid,nodev /dev/md24 mnt-new # cmp mnt-{old,new}/urandom && echo MATCHED MATCHED # < mnt-new/urandom sha512sum 6e0487bed425a7bb667d169e001415a4a18c6413ee5be56f032ebac7ea827dae9caee9ab0d0801e1d1b537eabc75cee9842de00a1089fd0afd7e7630752128aa - # // And we can see that the files match and the hash matches our earlier. // Let's do it one more time, except this time with the filesystem very // busy while it's mounted and doing the RAID-1 sync. // And this time we mirror from the new, to old, as the new now has // exactly the data we want. # umount mnt-old && umount mnt-new # dmsetup create dmr1 --table '0 50276352 raid raid1 5 0 region_size 32 rebuild 0 2 /dev/loop7 /dev/loop6 /dev/loop8 /dev/md24' && mount -o nosuid,nodev /dev/mapper/dmr1 /mnt && { dd if=/dev/urandom of=/mnt/urandom bs=1048576 status=none; < /mnt/urandom sha512sum; } dd: error writing '/mnt/urandom': No space left on device c818fb535d037b01868252a3f2464cc17fa70b8f4cb21436a0f7d3d9c85b4783ac7e7835f47d55c588287688013b687743886f5640ddfb91ff9e2f8177dd5b38 - # // We check until we see it's synced: # dmsetup status dmr1 0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 - # // Then we unmount, and again reconfirm it's synced: # umount /mnt # dmsetup status dmr1 0 50276352 raid raid1 2 AA 50276352/50276352 idle 0 0 - # // Now we again deconstruct the RAID-1, update label and UUID on old, // and mount and compare, and also again compute hash on one of the // files to see that also matches: # dmsetup remove dmr1 # tune2fs -L 24gr5.old -U random /dev/loop6 tune2fs 1.47.0 (5-Feb-2023) # mount -o ro,nosuid,nodev /dev/loop6 mnt-old # mount -o ro,nosuid,nodev /dev/md24 mnt-new # cmp mnt-{old,new}/urandom && echo MATCHED MATCHED # < mnt-old/urandom sha512sum c818fb535d037b01868252a3f2464cc17fa70b8f4cb21436a0f7d3d9c85b4783ac7e7835f47d55c588287688013b687743886f5640ddfb91ff9e2f8177dd5b38 - // So, files again matched, and our hash again matches. Also, checked // how much space was actually used for the meta devices to track RAID-1 // status: # stat -c '%b %n' m[01] 400 m0 400 m1 # // So that's 400 512 byte blocks, 200 KiB. // That size apparently depends upon size of device and region_size, // and does apparently have a limit on total maximum size it will use // for that meta device.
So, our original scenario to replicate the RAID-5 from hardware RAID to software (md) RAID while minimizing downtime would go about like this: o create the new target md device, note its precise size o stop all I/O on the old device (e.g. unmount it). o note size of old - must not be larger than new, if necessary shrink and/or repartition, etc. or the like as appropriate. o create dm device as RAID-1 between the old and new, syncing to new o return I/O to service (using the dm device in place of the old device) o after sync has completed (may take hours to days for larger/slower storage), as before, stop all I/O - except now on the dm device instead of the old device. Again check/wait until it's fully synced. o tear down the dm device o Adjust labels, UUIDs, etc. if/as applicable on old to not conflict with new o return I/O to service, using new in place of old o if/as desired, tear down or decommission old
Key advantage is relative minimization of downtime - across the hours (or days or more) while the RAID-1 mirror is syncing to duplicate the storage data, no downtime is needed, and the storage can be used per usual. A bit of downtime to shuffle about, notably from old, to dm, then to new, but other than that, things generally remain online and actively available. Additionally, no headers or encapsulation need happen on the old or new storage itself - that's all handled externally, so that keeps it clean and relatively simple.