It's plugging hot!

After vlad died last week, I rebuilt him with a new hard drive in the main system RAID array. This drive was twice the size of the old one – 160GiB, not 80GiB – so I had a bunch of spare space not being used. Yesterday, I bought another 160GiB drive, and decided to test the whole SATA hotplug thing…

It works. Beautifully.

However, I wouldn’t recommend trying it without LVM on your side, and probably the RAID subsystem too. Here’s what I did.

First, check that I really do know what my RAID configuration is:

$ cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sda2[0] sdb2[1]
      78019584 blocks [2/2] [UU]
      
unused devices: 

$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sun May 14 18:37:29 2006
     Raid Level : raid1
     Array Size : 78019584 (74.41 GiB 79.89 GB)
    Device Size : 78019584 (74.41 GiB 79.89 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sun Apr 20 18:47:27 2008
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 368229f2:0e38f898:b369f97a:73d67d7e
         Events : 0.14487488

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

I have sda2 and sdb2 in the array, and it’s fully working (so even if I do pull the wrong drive out, I’ll still have my data safe). Then make sure that I know which drive is which:

$ cat /sys/block/sda/device/model
ST380815AS
$ cat /sys/block/sdb/device/model
ST3160815AS

So, sdb is the new, larger drive and sda is the older, smaller one. The first job is to ensure that the old drive isn’t expected to be there. I suspect that I could have just yanked out the drive and let the RAID layer deal with the sudden non-existence of one of its drives, like it’s meant to do, but I was already about to do something that my 25+ years experience of computers said was dangerous, so I didn’t want to invite more disaster.

Therefore, I told the RAID system to drop the old drive from the array:

$ sudo mdadm --manage /dev/md0 --fail /dev/sda2

That gave me a rude email in my inbox telling me that I’d lost a drive in my RAID array.

Then… the moment of truth. I opened up the case, found the drive I wanted1, and pulled out the data cable. I got a bunch of scary-looking messages in the syslog. Everything still seemed to be working. No sparks. No blue smoke. Mahler 5 played on.

Something of an anticlimax, really.

After plugging in the new drive, I got another bunch of messages in the syslog telling me that the new drive was /dev/sdd. Still no blue smoke. Comets failed to pass overhead. No two-headed lambs were reported in the village outside the castle.

So now for putting it all back together. First, create some partitions on the new, virgin disk. (Aha… that’s where the blood came from):

$ sudo cfdisk /dev/sdd

Then, in quick succession, add the new partition to the RAID array, and remove the old drive completely:

$ sudo mdadm --manage /dev/md0 --add /dev/sdd2
$ sudo mdadm --manage /dev/md0 --remove /dev/.static/dev/sda2

Finally, wait until it’s rebuilt, and check that it’s all OK:

$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sun May 14 18:37:29 2006
     Raid Level : raid1
     Array Size : 78019584 (74.41 GiB 79.89 GB)
    Device Size : 78019584 (74.41 GiB 79.89 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sun Apr 20 18:47:27 2008
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 368229f2:0e38f898:b369f97a:73d67d7e
         Events : 0.14487488

    Number   Major   Minor   RaidDevice State
       0       8       50        0      active sync   /dev/sdd2
       1       8       18        1      active sync   /dev/sdb2

That looks like what I was expecting. The one remaining problem now is that it’s still only 74GiB in size, despite having considerably more than that in the underlying volumes. This calls for some enlargement. First, grow the RAID volume to the maximum size allowed by the partitions it’s sitting on:

$ sudo mdadm --grow /dev/md0 --size max

This starts a resync process:

$ cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdd2[0] sdb2[1]
      156039232 blocks [2/2] [UU]
      [==========>..........]  resync = 50.2% (78459776/156039232) finish=20.5min speed=62884K/sec
      
unused devices: 

Secondly, tell LVM that the physical volume that is contained in the RAID array should be made bigger:

$ sudo pvresize /dev/md0
  Physical volume "/dev/md0" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized

$ sudo vgdisplay primary
  --- Volume group ---
  VG Name               primary
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  29
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                12
  Open LV               9
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               148.81 GB
  PE Size               4.00 MB
  Total PE              38095
  Alloc PE / Size       17305 / 67.60 GB
  Free  PE / Size       20790 / 81.21 GB
  VG UUID               VlBSZF-p0DK-Gm7I-sZjE-LBA0-TW5q-EgGmQV

That’s up to the size I was expecting… I’d say that’s all done (once my RAID array finishes syncing in 15 minutes’ time or so). Total server downtime from all this: nil.


1. OK, I’ll admit it. I got the wrong drive. I pulled out the cable from the wrong drive. A few seconds later, after the audio buffer emptied, xmms stopped playing Mahler 5. However, plugging it back in and restarting the LVM volume group that that drive was in, plus NFS, got it all back.