Swap RAID Drives

(Information stolen from: https://community.spiceworks.com/how_to/36066-replacing-a-failed-drive-in-a-linux-software-raid1-configuration-mdraid)


Scenario:

A drive has failed in your linux RAID1 configuration and you need to replace it.

Solution:

Use mdadm to fail the drive partition(s) and remove it from the RAID array.

Physically replace the drive in the system.

Create the same partition table on the new drive that existed on the old drive.

Add the drive partition(s) back into the RAID array.

In this example I have two drives named /dev/sdi and /dev/sdj. Each drive has 3 partitions and each partition is configured into a RAID1 array denoted by md#. We will assume that /dev/sdi has failed and needs to be replaced.

Note that in Linux Software RAID you can create RAID arrays by mirroring partitions and not entire disks.

4 Steps total

Step 1: Identify the faulty drive and array

Identify which RAID arrays have failed:

To identify if a RAID array has failed, look at the string containing [UU]. Each "U" represents a healthy partition in the RAID array. If you see [UU] the array is healthy. If you see a missing "U" like [U_] then the RAID array is degraded or faulty.

# cat /proc/mdstat

Personalities : [raid1]

md0 : active raid1 sdj1[0]

102336 blocks super 1.0 [2/1] [U_]


md2 : active raid1 sdj3[0]

233147200 blocks super 1.1 [2/1] [U_]

bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active (auto-read-only) raid1 sdj2[0]

1048000 blocks super 1.1 [2/1] [U_]

From the above output we can see that RAID arrays md0, md1, and md2 are missing a "U" and are degraded or faulty.

Step 2: Remove the failed partition(s) and drive

Before we can physically remove the hard drive from the system we must first "fail" the disk partition(s) from all RAID arrays to which the failed drive belongs. In our example, /dev/sdi is a member of all three RAID arrays, but even if only one RAID array had failed we must still fail the drive for all three arrays before we remove it.

To fail the partitions we issue the following command:

# mdadm --manage /dev/md0 --fail /dev/sdi1

# mdadm --manage /dev/md1 --fail /dev/sdi2

# mdadm --manage /dev/md2 --fail /dev/sdi3

To remove the partitions from the RAID array:

# mdadm --manage /dev/md0 --remove /dev/sdi1

# mdadm --manage /dev/md1 --remove /dev/sdi2

# mdadm --manage /dev/md2 --remove /dev/sdi3

Now you can power off the system and physically replace the defective drive:

# shutdown -h now

Step 3: Adding the new disk to the RAID arrays

Now that the new hard drive is installed we can add it to the RAID array. In order to use the new drive we must create the exact same partition table structure that was on the old drive. We can use the existing drive and mirror its partition table structure to the new drive. There is an easy command to do this:

# sfdisk -d /dev/sdj | sfdisk /dev/sdi

Note that sometimes when removing drives and replacing them the drive device name may change. For our example here we will make sure the drive we replaced is listed as /dev/sdi by issuing the command "fdisk -l /dev/sdi" and verifying that no partitions exist.

Now that the partitions are configured on the newly installed hard drive, we can add the partitions to the RAID arrays.

# mdadm --manage /dev/md0 --add /dev/sdi1

mdadm: added /dev/sdi1

Repeat this command for each partition changing /dev/md# and /dev/sdi#:

# mdadm --manage /dev/md1 --add /dev/sdi2

mdadm: added /dev/sdi2

# mdadm --manage /dev/md2 --add /dev/sdi3

mdadm: added /dev/sdi3

Now we can check that the partitions are being synchronized by issuing:

# cat /proc/mdstat

Personalities : [raid1]

md0 : active raid1 sdi1[2] sdj1[0]

102336 blocks super 1.0 [2/2] [UU]


md2 : active raid1 sdi3[2] sdj3[0]

233147200 blocks super 1.1 [2/1] [U_]

[>....................] recovery = 0.4% (968576/233147200) finish=15.9min speed=242144K/sec

bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid1 sdj2[2] sdi2[0]

1048000 blocks super 1.1 [2/2] [UU]

Once all drives have synchronized your RAID array will be back to normal.