Linux Software RAID, hands on introduction
I've always wanted to get into Linux software RAID setups, but my laptop only has one harddisk and my other machines are still packed up into boxes from the recent move into the new flat.
So I had the idea of using several flash disks in my cardreader, all appearing as separate SCSI devices within Linux, but that failed due to a faulty SMC card, leaving me with only two working "drives" and that doesn't allow for certain RAID types.
Last resort: Install VMWare, create a bunch of disks and start cracking... And that's what I did. I created a VMWare instance with five disks and installed Linux on the first one:
Preparation
First of all, I created partitions on the RAID disks(sdb, sdc, sdd and sde) with the partition type 0xFD(Linux RAID autodetect):
All disks were set up exactly like this. This isn't a requirement, but you will be limited to the size of the smallest disk with most RAID types(eg. in mirroring mode, the resulting array will only be as large as your smallest disk).
RAID0 (Striping)
Striping is used to increase performance when you are hitting one harddisk's or controller's I/O limit. Instead of reading or writing to only one drive, data is written to two or more drives, distributing bits of data evenly across all drives. That way, you get about twice as much I/O bandwidth as with a single disk.
Setting this up in Linux works like this:
This line simply says: "Create a RAID array /dev/md0 with RAID level0(striping) consisting of four devices named /dev/sdb1 /dev/sdc1 /dev/sdd1 and /dev/sde1".
And the output of the command should look something like this:
There you have it, a RAID array with the cumulated size of all four disks and higher I/O bandwidth, because data is distributed among the member disks of this array.
RAID1 (mirroring)
Sometime, speed is not what you want, but safety is. That's where mirroring is used.
All data is not only written to one disks, but to two disks simultaneously. While this might have an impact on write performance(especially when both disks are on the same controller(which leads to another single point of failure and thus: less safety)), reading from a mirrored disks is much like reading from a stripe set: you can use the I/O bandwidth of two disks to fetch data.
A mirror is set up very much like a stripe set, so I spiced this example up a bit by adding a hot spare. A hot spare disk is used whenever a disk in the RAID arrays fails. It simply takes its place, is synchronized with the other disks and the array is as healthy as before, only that there's one spare disk less and you should consider replacing it.
mdadm, Linux' RAID administration command takes two arguments named "--raid-devices" and "--spare-devices". Each one expects a number stating the number of disks of the given category and the disks at the end of the command line will be assigned accordingly:
In this example, we used three RAID disks and one hot spare. And as stated above, the first three device names are used as RAID devices while the remaining one device name is used as the hot spare.
The output of the command looks like this:
You can easily spot the spare disks in /proc/mdstat by looking for devices with an "(S)" for "spare" besides them.
RAID5 (stripe set with distributed parity)
RAID5 is basically another form of RAID0, a bunch of disks is used to distribute data across them. But unlike RAID0, RAID5 can take the failure of one disks without losing access to the data. This is done by writing parity information in addition to the data written on the disk. This type of RAID array needs at least three disks, yielding only the space of two of them. That's because for every two stripes of data written, a third parity stripe is written on the remaining disk, in alternating patterns. So for the first stripe, this might be "data, data, parity" and "data, parity, data" for the second, "parity, data, data" for the third and so on.
What makes it desirable to sacrifice a third of the space(when using 3 disks) to parity information is, that each one of the stripes can be recovered by a simple calculation. If you lose the disks with the parity stripe on it, no sweat, it can easily be calculated from the two data stripes(as it's done on every write access to the RAID). And if you lose one of the data stripes, it can be recreated by calculation from the parity information and the other data stripe.
Building a RAID 5 array with mdadm is done like this:
Just like the example before, we added one hot spare to the array consisting of three disks.
The output from mdadm looks like this:
Hot spare usage
Last but not least, here's what you see when a disk fails(I've chosen to set the disk faulty with mdadm):
And this concludes this entry about creating RAIDs on Linux..
So I had the idea of using several flash disks in my cardreader, all appearing as separate SCSI devices within Linux, but that failed due to a faulty SMC card, leaving me with only two working "drives" and that doesn't allow for certain RAID types.
Last resort: Install VMWare, create a bunch of disks and start cracking... And that's what I did. I created a VMWare instance with five disks and installed Linux on the first one:
# dmesg | grep SCSI | grep hdwr | sort | uniq
SCSI device sda: 1677721 512-byte hdwr sectors (859 MB)
SCSI device sdb: 209715 512-byte hdwr sectors (107 MB)
SCSI device sdc: 209715 512-byte hdwr sectors (107 MB)
SCSI device sdd: 209715 512-byte hdwr sectors (107 MB)
SCSI device sde: 209715 512-byte hdwr sectors (107 MB)
Preparation
First of all, I created partitions on the RAID disks(sdb, sdc, sdd and sde) with the partition type 0xFD(Linux RAID autodetect):
# sfdisk -l -uM /dev/sdb
Disk /dev/sdb: 102 cylinders, 64 heads, 32 sectors/track
Units = mebibytes of 1048576 bytes, blocks of 1024 bytes, counting from 0
Device Boot Start End MiB #blocks Id System
/dev/sdb1 0+ 101 102- 104432 fd Linux raid autodetect
/dev/sdb2 0 - 0 0 0 Empty
/dev/sdb3 0 - 0 0 0 Empty
/dev/sdb4 0 - 0 0 0 Empty
All disks were set up exactly like this. This isn't a requirement, but you will be limited to the size of the smallest disk with most RAID types(eg. in mirroring mode, the resulting array will only be as large as your smallest disk).
RAID0 (Striping)
Striping is used to increase performance when you are hitting one harddisk's or controller's I/O limit. Instead of reading or writing to only one drive, data is written to two or more drives, distributing bits of data evenly across all drives. That way, you get about twice as much I/O bandwidth as with a single disk.
Setting this up in Linux works like this:
# mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
This line simply says: "Create a RAID array /dev/md0 with RAID level0(striping) consisting of four devices named /dev/sdb1 /dev/sdc1 /dev/sdd1 and /dev/sde1".
And the output of the command should look something like this:
mdadm: chunk size defaults to 64K
md: bind<sdb1>
md: bind<sdc1>
md: bind<sdd1>
md: bind<sde1>
md0: setting max_sectors to 128, segment boundary to 32767
raid0: looking at sde1
raid0: comparing sde1(104320) with sde1(104320)
raid0: END
raid0: ==> UNIQUE
raid0: 1 zones
raid0: looking at sdd1
raid0: comparing sdd1(104320) with sde1(104320)
raid0: EQUAL
raid0: looking at sdc1
raid0: comparing sdc1(104320) with sde1(104320)
raid0: EQUAL
raid0: looking at sdb1
raid0: comparing sdb1(104320) with sde1(104320)
raid0: EQUAL
raid0: FINAL 1 zones
raid0: done.
raid0 : md_size is 417280 blocks.
raid0 : conf->hash_spacing is 417280 blocks.
raid0 : nb_zone is 1.
raid0 : Allocating 4 bytes for hash.
mdadm: array /dev/md0 started.
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid0 sde1[3] sdd1[2] sdc1[1] sdb1[0]
417280 blocks 64k chunks
unused devices: <none>
There you have it, a RAID array with the cumulated size of all four disks and higher I/O bandwidth, because data is distributed among the member disks of this array.
RAID1 (mirroring)
Sometime, speed is not what you want, but safety is. That's where mirroring is used.
All data is not only written to one disks, but to two disks simultaneously. While this might have an impact on write performance(especially when both disks are on the same controller(which leads to another single point of failure and thus: less safety)), reading from a mirrored disks is much like reading from a stripe set: you can use the I/O bandwidth of two disks to fetch data.
A mirror is set up very much like a stripe set, so I spiced this example up a bit by adding a hot spare. A hot spare disk is used whenever a disk in the RAID arrays fails. It simply takes its place, is synchronized with the other disks and the array is as healthy as before, only that there's one spare disk less and you should consider replacing it.
mdadm, Linux' RAID administration command takes two arguments named "--raid-devices" and "--spare-devices". Each one expects a number stating the number of disks of the given category and the disks at the end of the command line will be assigned accordingly:
# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=3 --spare-devices=1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
In this example, we used three RAID disks and one hot spare. And as stated above, the first three device names are used as RAID devices while the remaining one device name is used as the hot spare.
The output of the command looks like this:
mdadm: size set to 104320K
md: bind
md: bind
md: bind
md: bind
md: md0: raid array is not clean -- starting background reconstruction
raid1: raid set md0 active with 3 out of 3 mirrors
md: syncing RAID array md0
md: minimum guaranteed reconstruction speed: 1000 KB/sec/disc.
md: using maximum idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 104320 blocks.
mdadm: array /dev/md0 started.
md: md0: sync done.
RAID1 conf printout:
--- wd:3 rd:3
disk 0, wo:0, o:1, dev:sdb1
disk 1, wo:0, o:1, dev:sdc1
disk 2, wo:0, o:1, dev:sdd1
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sde1[3](S) sdd1[2] sdc1[1] sdb1[0]
1104320 blocks [3/3] [UUU]
unused devices: <none>
You can easily spot the spare disks in /proc/mdstat by looking for devices with an "(S)" for "spare" besides them.
RAID5 (stripe set with distributed parity)
RAID5 is basically another form of RAID0, a bunch of disks is used to distribute data across them. But unlike RAID0, RAID5 can take the failure of one disks without losing access to the data. This is done by writing parity information in addition to the data written on the disk. This type of RAID array needs at least three disks, yielding only the space of two of them. That's because for every two stripes of data written, a third parity stripe is written on the remaining disk, in alternating patterns. So for the first stripe, this might be "data, data, parity" and "data, parity, data" for the second, "parity, data, data" for the third and so on.
What makes it desirable to sacrifice a third of the space(when using 3 disks) to parity information is, that each one of the stripes can be recovered by a simple calculation. If you lose the disks with the parity stripe on it, no sweat, it can easily be calculated from the two data stripes(as it's done on every write access to the RAID). And if you lose one of the data stripes, it can be recreated by calculation from the parity information and the other data stripe.
Building a RAID 5 array with mdadm is done like this:
# mdadm --create --verbose /dev/md0 --level=5 --raid-devices=3 --spare-devices=1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
Just like the example before, we added one hot spare to the array consisting of three disks.
The output from mdadm looks like this:
mdadm: layout defaults to left-symmetric
mdadm: chunk size defaults to 64K
mdadm: size set to 104320K
md: bind
md: bind
md: bind
md: bind
raid5: device sdc1 operational as raid disk 1
raid5: device sdb1 operational as raid disk 0
raid5: allocated 3163kB for md0
raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
RAID5 conf printout:
--- rd:3 wd:2 fd:1
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:sdc1
RAID5 conf printout:
--- rd:3 wd:2 fd:1
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:sdc1
disk 2, o:1, dev:sdd1
RAID5 conf printout:
--- rd:3 wd:2 fd:1
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:sdc1
disk 2, o:1, dev:sdd1
md: syncing RAID array md0
md: minimum guaranteed reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 104320 blocks.
mdadm: array /dev/md0 started.
md: md0: sync done.
RAID5 conf printout:
--- rd:3 wd:3 fd:0
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:sdc1
disk 2, o:1, dev:sdd1
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd1[2] sde1[3](S) sdc1[1] sdb1[0]
208640 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>
Hot spare usage
Last but not least, here's what you see when a disk fails(I've chosen to set the disk faulty with mdadm):
# mdadm --manage --set-faulty /dev/md0 /dev/sdc1
raid5: Disk failure on sdc1, disabling device. Operation continuing on 2 devices
RAID5 conf printout:
--- rd:3 wd:2 fd:1
disk 0, o:1, dev:sdb1
disk 1, o:0, dev:sdc1
disk 2, o:1, dev:sdd1
mdadm: set /dev/sdc1 faulty in /dev/md0
RAID5 conf printout:
--- rd:3 wd:2 fd:1
disk 0, o:1, dev:sdb1
disk 2, o:1, dev:sdd1
RAID5 conf printout:
--- rd:3 wd:2 fd:1
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:sde1
disk 2, o:1, dev:sdd1
md: syncing RAID array md0
md: minimum guaranteed reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 104320 blocks.
md: md0: sync done.
RAID5 conf printout:
--- rd:3 wd:3 fd:0
disk 0, o:1, dev:sdb1
disk 1, o:1, dev:sde1
disk 2, o:1, dev:sdd1
And this concludes this entry about creating RAIDs on Linux..

