Wednesday, November 11, 2015

Simple Linux boot repair by loading GRUB stage 1 from a file.

When booting on a PC with a legacy BIOS and MBR partitioning, Linux has a problem. At the start of the boot process, the BIOS can only load the first sector from the drive. That's only 512 bytes, and because it contains the partition table, only 446 bytes are available for code which could continue boot-up. The standard MBR code would load the first sector from the active partition, which is once again 512 bytes.

That's not enough space for GRUB, and it's not even enough space for code which could load a file from ext4 or other complex file systems used by Linux. The default solution is to put some more code right after the first sector, in the free space between it and the first partition. The BIOS loads the stage 1 code in the MBR, that code loads the stage 1.5 code which follows, and then finally it can load stage 2 from the Linux file system.

The problem is that it depends on GRUB stage 1 being in the MBR sector. When Windows and other operating systems are installed, they can overwrite stage 1 with their own code. The MBR code installed by Windows is reasonable. It will boot from whatever partition is active. However, if you installed GRUB in the MBR, the Linux partition is not bootable, so that won't help you.

There is plenty of information online on how to fix this by booting from removable media and re-installing GRUB. I'm proposing a much simper fix: load GRUB stage 1 from a file or write it to the MBR sector yourself. If you're running Windows you can take GRUB stage1 or boot.img files, put them somewhere on the Windows partition, and set up the Windows boot loader to boot them. In Windows Vista and later, you can do this via bcdedit. In earlier versions, it's just a single line of text in C:\boot.ini, like C:\stage1="GRUB stage 1".

If you want to write stage 1 to the MBR sector yourself, you can do that with a disk editor or dd in Cygwin. It's critical to remember to only overwrite the first 446 bytes, so you don't overwrite the partition table! This would mean using bs=446 count=1 arguments with dd. (If you overwrite the partition table, all is not lost though. TestDisk should be able to recover it.) GRUB stage 1 has some code after the first 446 bytes which doesn't get copied, but that code is only needed when booting from floppy. It's also critical to understand where you're writing, because writing to the wrong place could corrupt a file system! It is easier and safer to boot stage 1 from a file and then once booted into Linux use grub-install to do this.

On a system with multiple operating systems, I prefer installing GRUB in the Linux partition. In that situation, grub-install will complain about blocklists, and --force will be needed to make it install. The problem here is that the location of the file to load next is hard-coded in the boot sector, and if that file moves you can't boot anymore. It hasn't been a problem in practice when booting normally from a Linux partition. It is however a problem when using EasyBCD, because it makes a copy of the Linux boot sector which becomes out of date. My solution there is to chain load the Linux boot sector.