tech notes

Recovery of Failure Due to Partial Harddisk Failure

Recently a colleague called us and asked whether we could help him. His gentoo-system suffered from a boot problem and he was not able to start his system up which was showing "BOOT DRIVE FAILURE". Since he did not setup the system (which was his IMAP-server, fax-server, phone-management server and on top his webserver to handle the frontend for the customers) he was pretty much clueless about what to do. After arriving at his house he already took apart his computer and took out the drive which seems to took a crap. Connecting this drive via an external USB adapter to my laptop showed exactly nothing. The drive was not showing up, but I could hear how it was physically spinning up. I decided not to waste any time with this drive and asked him about the backups. He didn’t know the details but gave me the phone number of the guy who set up the system. After a call it was clear that there were indeed backups but only backups of the application data and none of the system. Grasping by the thought of setting up his fax-server (I hate doing this in linux) and his mail server within 3 hours (he needed to have his system running a.s.a.p.) I decided to have a quick look at the other drive which was built inside hoping to find at least a few config files etc. So I took out the other drive, connected it to my laptop and was happy to see a working filesystem with /bin, /boot etc folders which gave me a good feeling about being able to quickly get the system running. However I found out that the original maintainer of the system decided to take backups … onto the same drive. Doh!

After putting the drive back into the server and hoping that the system would start up, that 2nd drive was shown as failing in the POST-messages while booting. Ohboy! Disconnecting the drive and reconnecting it to my laptop showed indeed that the drive was crapping out as well and that I could not access the data anymore. What a great day. Additionally I realized that the drive was making very loud and weird noise indicating that it was about to say goodbye completely. Grasping even more… I initiated the download of an Ubuntu 8.0.4-server cd and went to lunch since I was behind a 2mbps DSL line and downloading a CD would take about 45minutes.

When I came back from lunch and while I started to burn the CD I decided to give the first drive another chance. I connected it and – damn! – it started and I could access the data on the drive. Making sure not to waste any time I initiated an image process of that drive (who would know how many minutes the drive would work?).

In a VM I had Knoppix (a live distro) running. I had connected a share from my Windows laptop to the Knoppix:

mount -t cifs -o username=administrator //192.168.222.1/laptopShare /mnt/smb/

Started a dd_rescue into a loop-file:

dd_rescue -b 4M /dev/sdb /mnt/smb/backup/sdb_image

A quick analysis showed that the disk had a capacity of 80GB, which was split onto two partitions. Partition 1 had 50GB, partition 2 30GB. While it was imaging the drive I started setting up the bare system from scratch just in case that I could not restore any of the system or the drive would crap out again.

After two hours the imaging process slowed down severely and the drive was making funny noises, but I was able to read about 57GB so chances were good that at least the first partition was rescued. Next step was to check out what kind of data I just rescued (as mentioned I didn’t want to take any chance so I did not interrupt or delay the imaging process), mounting the file. Since I did not copy each partition seperately I had to find out where the partitions beginnings were, using fdisk:

debian:~# fdisk -u -l /mnt/smb/backup/sdb
You must set cylinders.
You can do this from the extra functions menu.

Disk /mnt/smb/backup/sdb: 0 MB, 0 bytes
255 heads, 63 sectors/track, 0 cylinders, total 0 sectors
Units = sectors of 1 * 512 = 512 bytes

              Device Boot      Start         End      Blocks   Id  System
/mnt/smb/backup/sdb1              63   100020689    50010313+  83  Linux
Partition 1 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(6225, 254, 63)
/mnt/smb/backup/sdb2       100020690   160826714    30403012+  83  Linux
Partition 2 has different physical/logical beginnings (non-Linux?):
     phys=(1023, 254, 63) logical=(6226, 0, 1)
Partition 2 has different physical/logical endings:
     phys=(1023, 254, 63) logical=(10010, 254, 63)

In this case I wanted to know the offset for partition, which is sector 63. Multiply this with the sector size (512B/sector) and you get the offset of 32256 Bytes.

debian:~# mount -o loop,offset=32256 /mnt/smb/backup/sdb /mnt/loop
debian:~# cd /mnt/loop
debian:/mnt/loop# ls
bin   dev  home  lost+found  opt   root  sys  usr
boot  etc  lib   mnt         proc  sbin  tmp  var

Excellent. How about partition 2? Let’s have a look:

debian:/mnt/loop# mount -o loop,offset=51210593280 /mnt/smb/backup/sdb /mnt/loop2
debian:/mnt/loop# ls -al /mnt/loop2
total 637964
drwxr-xr-x  6 root root      4096 2009-06-13 14:35 .
drwxr-xr-x 16 root root      4096 2009-06-18 15:51 ..
-rw-r--r--  1 root root   2937770 2009-06-13 14:06 bin.tar.bz2
-rw-r--r--  1 root root    525861 2009-06-13 14:06 etc.tar.bz2
-rw-r--r--  1 root root   6510707 2009-06-13 14:07 home.tar.bz2
-rw-r--r--  1 root root   8406356 2009-06-13 14:07 lib.tar.bz2
drwx------  2 root root     16384 2009-05-01 20:59 lost+found
?---------  ? ?    ?            ?                ? /mnt/loop2/mailing
?---------  ? ?    ?            ?                ? /mnt/loop2/mails
?---------  ? ?    ?            ?                ? /mnt/loop2/test
-rw-r--r--  1 root root       182 2009-06-13 14:07 opt.tar.bz2
-rw-r--r--  1 root root  13524123 2009-06-13 14:08 root.tar.bz2
-rw-r--r--  1 root root   1883672 2009-06-13 14:08 sbin.tar.bz2
-rw-r--r--  1 root root 447615119 2009-06-13 14:35 usr.tar.bz2
-rw-r--r--  1 root root 171154246 2009-06-13 14:53 var.tar.bz2

Ah! There are the backups… on the same drive.
Since I had the images on my laptop and all I had was a rather slow USB adapter I decided to partly installed Ubunto on the server, add a new drive into that machine and copy the image back via network (my laptop and the server had a gigabit NIC) to save some time. Now there are nice ways to do this with netcat (described here, if you prefer ssh) I decided to use the more simple approach by mapping the share from my laptop to the server.

The next issue which I got into was that the dd’ed drive did not work properly. The filesystem showed all kind of different errors and while mounting the device I had some “attempt to access beyond end of device” errors (or similiar). cfdisk showed the proper sizes but it seems like something else was screwed. Even when I tried to create the fs again on that partition it showed a far too small partition size. I decided to create a larger partition manually ( >50GB ) and then just copy back the one partition only (the 2nd partition did not have any value to me anyway since it was incomplete). I calculated the offset on the target drive (see above) and startet another dd with supplying the source and target offsets. That worked fine and the data was consisten on accessible afterwards.

Now let’s  get this drive booted. grub! It’s been ages that I have used grub so I had to get it done by reading, trial and error. Basically these were my steps:

  1. Boot the server from a Knoppix live system (so that drive names are not mixed. For some reason the Ubunto distro showed sda’s instead of hda’s)
  2. Copy over the grub-bootloader files from /usr/lib/grub/i386pc/stage1 to /boot/grub/
  3. If you are using Knoppix you have to get around the /dev/null: Permission Denied-error.
  4. chroot into the path where you have mounted the partition from where you want to boot from (more details here)
  5. invoke a proper grub-install command
  6. edit the grub-menu, it may look something like
  7. boot
    title           Ubuntu, kernel 2.6.15-25-k7 (recovery mode)
    root            (hd0,0)
    kernel          /boot/vmlinuz-2.6.15-25-k7 root=/dev/sda1 ro single
    initrd          /boot/initrd.img-2.6.15-25-k7
  8. reboot the system
  9. you should be done.

After that I could boot into the system… and encountered a freeze, which was because I forgot to edit the fstab. Correcting it made the system boot up properly and all of the service were accessible afterwards.

Raw ext2 Recovery (Part 1)

Recovering a ext2 partition which has been partly overwritten

What a stupid mistake that was. I had my ext2 drive in my windows box. Since I was trying out a few things I was trying to mount a drive via iSCSI and I accidentally formatted the local drive instead of the iSCSI drive (that explains why I got a whopping 80Mb/s when writing to that drive ;-) ). So sh** happens, I formatted that drive, and copied about a 2GB file onto that drive before realizing what was happening.

My first step was to add a second drive of that similar size so I can clone that drive. This would give me the freedom to try a few recovery options out without messing up things any further. The original drive was /dev/sdc, the added one was /dev/sdd (the first primary partition being the partition I wanted to recover):

dd_rescue -b 4M /dev/sdc /dev/sdd

This took about 3 hours for the 1.5TB.

After that my first thought was to use e2fsck trying to fix things, specifying an alternate superblock since the original was wiped out:

e2fsck -v -y -b 20480000 /dev/sdd1

Finding an alternative superblock was easy:

mkfs.ext2 -n /dev/sdd1

“-n” tells mkfs.ext2 not to actually create the filesystem, but print out what it would do if it was creating the filesystem:

1 root@grml ~ # mkfs.ext3 -n /dev/sdc1
mke2fs 1.41.6 (30-May-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
91578368 inodes, 366284000 blocks
18314200 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
11179 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848

Back to e2fsck. It gave me a lot of “dtime” errors which seem to be not so dangerous. But what was quite worrying was the fact that it showed a lot of messages like this:

Multiply-claimed block(s) in inode 40198333: 163480071

After rerunning all previous steps I made sure that the output of e2fsck is being saved for further analysis. It showed me a list of files which looked like this:

File /Images/_SP/2009-01.iso (Inode #27640092, mod time Wed Mar 18 18:38:07 2009)
  has 7 Multiply-claimed block(s), shared with 8 file(s):

	... (Inode #11720837, mod time Mon May 27 04:08:51 1912)
	... (Inode #11687481, mod time Mon Mar 31 03:23:00 1924)
	... (Inode #11743113, mod time Sat Aug 26 17:41:05 2000)
	... (Inode #11700725, mod time Thu Nov 17 04:16:01 1921)
	... (Inode #11807290, mod time Mon Jan  9 08:12:45 1989)
	... (Inode #11781593, mod time Tue Jun  1 19:26:38 2004)
	... (Inode #11711346, mod time Fri Jan 16 00:54:26 1925)
Multiply-claimed blocks already reassigned or cloned.

So I had a list of files which it said were sharing some blocks. At first I thought I could get away with this and this would be just some kind of bug or something. Further analysis showed that almost all of those files were indeed corrupted (I checked the md5-hashes). :-(

Next step was to try e2salvage. I could not compile it (the source has not been maintained for years) but I found a rescue cd called “PLD Rescue CD” which had a binary of e2salvage. Unfortunately e2salvage didn’t like to run complaining about the missing superblock. Even supplying an alternative superblock did not help. It started a few things but then got stuck (I copied over the superblock manually which did not have any real effect) so I scraped the idea of using e2salvage.

Then I tried a Windows ext recovery tool, called FIXME
It was able to recover a lot of files and maintaining file integrity. So there must be some hope to achieve this in linux as well!

Again back to e2fsck. Maybe e2fsck was confused with all the (random) data found now in the inodes, so maybe wiping those areas which were overwritten in my first mishap would make things easier for e2fsck?

Next step: Identify blocks on which the faulty inodes are mapped to.
First filter out the inode numbers:

grep "Inode" e2fsck.output.log > inodes
cat inodes | sed "s#.*node \([0-9]*\).*#\1#g" > inodes.filtered
egrep "^[0-9]*$" inodes.filtered | egrep "^[0-9]*$" | sort | uniq > inode.nums

Then ask debugfs to find the corresponding blocks:

cat inode.nums | while read l ; do echo "imap <$l>" >> debugfs.todo; done
debugfs -c -b 4096 -s 229376 /dev/sdd1 -f debugfs.todo > debugfs.out
grep located debugfs.out | sed "s#.*block \([0-9]*\).*#\1#g" | sort -n | uniq > blocks

Voilà – a nice list of blocks concerning all affected inodes. So now I had a look at the list and decided which were those area which were overwritten in the first place. It was more or less a big chunk of blocks which were near to each other. I now zeroed out those blocks.

cat blocks | while read l; do dd_rescue -m 1024 -s $(( $l * 4096 )) /dev/zero /dev/sdd1; done

Then I started e2fsck again…

Further Links

Archives