Recovery of Failure Due to Partial Harddisk Failure

Recently a colleague called us and asked whether we could help him. His gentoo-system suffered from a boot problem and he was not able to start his system up which was showing "BOOT DRIVE FAILURE". Since he did not setup the system (which was his IMAP-server, fax-server, phone-management server and on top his webserver to handle the frontend for the customers) he was pretty much clueless about what to do. After arriving at his house he already took apart his computer and took out the drive which seems to took a crap. Connecting this drive via an external USB adapter to my laptop showed exactly nothing. The drive was not showing up, but I could hear how it was physically spinning up. I decided not to waste any time with this drive and asked him about the backups. He didn’t know the details but gave me the phone number of the guy who set up the system. After a call it was clear that there were indeed backups but only backups of the application data and none of the system. Grasping by the thought of setting up his fax-server (I hate doing this in linux) and his mail server within 3 hours (he needed to have his system running a.s.a.p.) I decided to have a quick look at the other drive which was built inside hoping to find at least a few config files etc. So I took out the other drive, connected it to my laptop and was happy to see a working filesystem with /bin, /boot etc folders which gave me a good feeling about being able to quickly get the system running. However I found out that the original maintainer of the system decided to take backups … onto the same drive. Doh!

After putting the drive back into the server and hoping that the system would start up, that 2nd drive was shown as failing in the POST-messages while booting. Ohboy! Disconnecting the drive and reconnecting it to my laptop showed indeed that the drive was crapping out as well and that I could not access the data anymore. What a great day. Additionally I realized that the drive was making very loud and weird noise indicating that it was about to say goodbye completely. Grasping even more… I initiated the download of an Ubuntu 8.0.4-server cd and went to lunch since I was behind a 2mbps DSL line and downloading a CD would take about 45minutes.

When I came back from lunch and while I started to burn the CD I decided to give the first drive another chance. I connected it and – damn! – it started and I could access the data on the drive. Making sure not to waste any time I initiated an image process of that drive (who would know how many minutes the drive would work?).

In a VM I had Knoppix (a live distro) running. I had connected a share from my Windows laptop to the Knoppix:

mount -t cifs -o username=administrator //192.168.222.1/laptopShare /mnt/smb/

Started a dd_rescue into a loop-file:

dd_rescue -b 4M /dev/sdb /mnt/smb/backup/sdb_image

A quick analysis showed that the disk had a capacity of 80GB, which was split onto two partitions. Partition 1 had 50GB, partition 2 30GB. While it was imaging the drive I started setting up the bare system from scratch just in case that I could not restore any of the system or the drive would crap out again.

After two hours the imaging process slowed down severely and the drive was making funny noises, but I was able to read about 57GB so chances were good that at least the first partition was rescued. Next step was to check out what kind of data I just rescued (as mentioned I didn’t want to take any chance so I did not interrupt or delay the imaging process), mounting the file. Since I did not copy each partition seperately I had to find out where the partitions beginnings were, using fdisk:

debian:~# fdisk -u -l /mnt/smb/backup/sdb
You must set cylinders.
You can do this from the extra functions menu.

Disk /mnt/smb/backup/sdb: 0 MB, 0 bytes
255 heads, 63 sectors/track, 0 cylinders, total 0 sectors
Units = sectors of 1 * 512 = 512 bytes

              Device Boot      Start         End      Blocks   Id  System
/mnt/smb/backup/sdb1              63   100020689    50010313+  83  Linux
Partition 1 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(6225, 254, 63)
/mnt/smb/backup/sdb2       100020690   160826714    30403012+  83  Linux
Partition 2 has different physical/logical beginnings (non-Linux?):
     phys=(1023, 254, 63) logical=(6226, 0, 1)
Partition 2 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(10010, 254, 63)

In this case I wanted to know the offset for partition, which is sector 63. Multiply this with the sector size (512B/sector) and you get the offset of 32256 Bytes.

debian:~# mount -o loop,offset=32256 /mnt/smb/backup/sdb /mnt/loop
debian:~# cd /mnt/loop
debian:/mnt/loop# ls
bin   dev  home  lost+found  opt   root  sys  usr
boot  etc  lib   mnt         proc  sbin  tmp  var

Excellent. How about partition 2? Let’s have a look:

debian:/mnt/loop# mount -o loop,offset=51210593280 /mnt/smb/backup/sdb /mnt/loop2
debian:/mnt/loop# ls -al /mnt/loop2
total 637964
drwxr-xr-x  6 root root      4096 2009-06-13 14:35 .
drwxr-xr-x 16 root root      4096 2009-06-18 15:51 ..
-rw-r--r--  1 root root   2937770 2009-06-13 14:06 bin.tar.bz2
-rw-r--r--  1 root root    525861 2009-06-13 14:06 etc.tar.bz2
-rw-r--r--  1 root root   6510707 2009-06-13 14:07 home.tar.bz2
-rw-r--r--  1 root root   8406356 2009-06-13 14:07 lib.tar.bz2
drwx------  2 root root     16384 2009-05-01 20:59 lost+found
?---------  ? ?    ?            ?                ? /mnt/loop2/mailing
?---------  ? ?    ?            ?                ? /mnt/loop2/mails
?---------  ? ?    ?            ?                ? /mnt/loop2/test
-rw-r--r--  1 root root       182 2009-06-13 14:07 opt.tar.bz2
-rw-r--r--  1 root root  13524123 2009-06-13 14:08 root.tar.bz2
-rw-r--r--  1 root root   1883672 2009-06-13 14:08 sbin.tar.bz2
-rw-r--r--  1 root root 447615119 2009-06-13 14:35 usr.tar.bz2
-rw-r--r--  1 root root 171154246 2009-06-13 14:53 var.tar.bz2

Ah! There are the backups… on the same drive.
Since I had the images on my laptop and all I had was a rather slow USB adapter I decided to partly installed Ubunto on the server, add a new drive into that machine and copy the image back via network (my laptop and the server had a gigabit NIC) to save some time. Now there are nice ways to do this with netcat (described here, if you prefer ssh) I decided to use the more simple approach by mapping the share from my laptop to the server.

The next issue which I got into was that the dd’ed drive did not work properly. The filesystem showed all kind of different errors and while mounting the device I had some “attempt to access beyond end of device” errors (or similiar). cfdisk showed the proper sizes but it seems like something else was screwed. Even when I tried to create the fs again on that partition it showed a far too small partition size. I decided to create a larger partition manually ( >50GB ) and then just copy back the one partition only (the 2nd partition did not have any value to me anyway since it was incomplete). I calculated the offset on the target drive (see above) and startet another dd with supplying the source and target offsets. That worked fine and the data was consisten on accessible afterwards.

Now let’s  get this drive booted. grub! It’s been ages that I have used grub so I had to get it done by reading, trial and error. Basically these were my steps:

  1. Boot the server from a Knoppix live system (so that drive names are not mixed. For some reason the Ubunto distro showed sda’s instead of hda’s)
  2. Copy over the grub-bootloader files from /usr/lib/grub/i386pc/stage1 to /boot/grub/
  3. If you are using Knoppix you have to get around the /dev/null: Permission Denied-error.
  4. chroot into the path where you have mounted the partition from where you want to boot from (more details here)
  5. invoke a proper grub-install command
  6. edit the grub-menu, it may look something like
  7. boot
    title           Ubuntu, kernel 2.6.15-25-k7 (recovery mode)
    root            (hd0,0)
    kernel          /boot/vmlinuz-2.6.15-25-k7 root=/dev/sda1 ro single
    initrd          /boot/initrd.img-2.6.15-25-k7
  8. reboot the system
  9. you should be done.

After that I could boot into the system… and encountered a freeze, which was because I forgot to edit the fstab. Correcting it made the system boot up properly and all of the service were accessible afterwards.

No Comments.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>