Last updated: Sep 17, 2002
NEW: RPM packages for RedHat Linux 7.3!
It seemed to me that the best way to run a diskless node was to put the root file system (which must be mounted read-write) in a RAM disk, and to use NFS to mount the remaining file systems read-only. Nonetheless, after extensive searching I was not able to find any single article that completely described this technique, although all the information in this article was gathered from various disparate sources on the Internet, in the Linux kernel documentation and in print publications. I have tried to provide pointers to the original sources where I could.
In fact, what my searches revealed to me is that there is a plethora of different ways to network boot, and once booted to run diskless. There are even compelling reasons for network booting if you do have local hard drives. The following table shows some of the possibilities.
Function | this article describes | some alternatives |
---|---|---|
network boot ROM | PXE | Etherboot, Netboot |
network boot loader | pxelinux | bpbatch |
root file system | RAM disk | NFS root, local hard drive |
If you choose one option from each row of the table above, there are at least eighteen different combinations. Even though only a subset of these will actually work (nine of them, I think), that is still far more than I want to cover in this article, so I have settled on the options shown in the second column. I'll try to provide pointers to the others in the resources section below. An excellent article by Richard Ferri, "Remote Linux Explained" which appeared in the January, 2002 issue of the Linux Journal, covers some of the other possibilities, notably NFS roots and Etherboot ROMs.
Most of the discussion in this article is aimed at the Beowulf community (e.g. I refer to the network booting computers as "nodes" in what follows), although the technique is described in enough generality that I hope it will find wider applications. One of my personal favorites is to turn old PCs into X terminals, which can usually be done for the cost of a NIC with boot ROMs and some additional memory for a RAM disk to hold the root file system.
Back in the Good Old Days of low-numbered RFCs, companies like NCD worked out that the best way to boot a diskless machine (in the case of NCD, an X-Terminal) was to have it run BOOTP to obtain an IP address and the name of a boot image file from a BOOTP server, then use TFTP to download the boot image from a possibly different server and launch itself from there. Not a whole lot has changed since then except that BOOTP has been expanded and renamed DHCP, and the processes of getting a bootable image to the client is done in two steps on PCs: first a boot loader is downloaded and started, which will then download and start the operating system image.
In the case of a PC, the important thing is that the NIC must be able identify itself as a bootable device to the motherboard BIOS, and if chosen as the boot device, it must be able to download the boot loader and start it. This is not typically something that a run-of-the-mill NIC can do (see the resources section below for a list of some that do). In particular, if you want to boot from the network, you must either choose a NIC that has some sort of boot ROM built in, or you can burn your own boot ROM and install it on the NIC, or you can even boot from a floppy or a local hard disk which contains the boot ROM image.
This article discusses network booting using a PXE boot ROM. PXE stands for "Preboot eXecution Environment", and is a result of Intel's "Wired for Management" initiative. It is a boot ROM standard that is becoming increasingly popular, and several vendors including Intel and 3Com, are offering products which implement it on their NICs. There is also a project to develop an open-source PXE implementation called NILO, for "Network Interface Loader".
The way PXE works is if the NIC is chosen by the motherboard BIOS as the boot device it broadcasts DHCP requests and waits for a response from a server that contains PXE extensions. If it receives such a response, then the NIC assumes that the boot loader file specified in the response can run under PXE, and it will download the boot loader and start it. Before transferring control to the boot loader, the PXE ROM will also put the network parameters from the DHCP response into a known location in memory where the boot loader has aclearcase/" target="_blank" >ccess to them. The boot loader will use this information to start a second round of TFTP to download a bootable image from the sever and start it. If the bootable image is a Linux kernel, then you have successfully booted from the network.
There are actually four different ways the kernel can configure its interface. Three of them are the familiar autoconfiguration protocols, namely DHCP, BOOTP and RARP. The fourth is to have the boot loader pass the IP parameters directly to the kernel as a kernel parameter. It should be noted that the kernel knows absolutely nothing about PXE, and therefore it does not have access to the data structures in memory where the PXE ROM stashed its network configuration parameters from the DHCP response (those were lost when the boot loader passed control to the kernel proper). Therefore, if you choose an autoconfiguration protocol such as DHCP for kernel-level configuration, it will cause the kernel to start a second round of DHCP requests (which should get the same response as the first since the MAC address is the same for the kernel as it was for the PXE ROM).
The way you get the kernel to use a RAM disk root file system is to exploit the "initial RAM disk" feature of the kernel. This feature is fully described in /usr/src/linux/Documentation/initrd.txt. Briefly, the way it works is the boot loader hands the kernel a compressed file system image which the kernel expands and mounts as root. Then the kernel looks for a script called /linuxrc and will run it if it exists. This script would normally be used to load kernel modules, although it could be used for anything. Once the script finishes, the kernel would unmount the RAM disk and then proceed with the normal boot up process. If this script is missing, the RAM disk will remain mounted as root and the kernel will continue with the boot up procedure from there. This is how you can get your box to run out of a RAM disk: if there is no /linuxrc script in the initial RAM disk then it will become a permanent RAM disk.
To start with, you will need a server. The server doesn't boot from the network, but it provides services that allow other computers to do so. These services are a PXE-extended DHCP and a TFTP that understands the TSIZE option.
In addition, you must configure dhcpd to use the PXE extensions and pass the boot loader to the clients. Here's an example /etc/dhcpd.conf configuration file that does exactly this:
# DHCP configuration file for DHCP ISC 3.0 ddns-update-style none; # Definition of PXE-specific options # Code 1: Multicast IP address of boot file server # Code 2: UDP port that client should monitor for MTFTP responses # Code 3: UDP port that MTFTP servers are using to listen for MTFTP requests # Code 4: Number of seconds a client must listen for activity before trying # to start a new MTFTP transfer # Code 5: Number of seconds a client must listen before trying to restart # a MTFTP transfer option space PXE; option PXE.mtftp-ip code 1 = ip-address; option PXE.mtftp-cport code 2 = unsigned integer 16; option PXE.mtftp-sport code 3 = unsigned integer 16; option PXE.mtftp-tmout code 4 = unsigned integer 8; option PXE.mtftp-delay code 5 = unsigned integer 8; option PXE.discovery-control code 6 = unsigned integer 8; option PXE.discovery-mcast-addr code 7 = ip-address; subnet 192.168.1.0 netmask 255.255.255.0 { class "pxeclients" { match if substring (option vendor-class-identifier, 0, 9) = "PXEClient"; option vendor-class-identifier "PXEClient"; vendor-option-space PXE; # At least one of the vendor-specific PXE options must be set in # order for the client boot ROMs to realize that we are a PXE-compliant # server. We set the MCAST IP address to 0.0.0.0 to tell the boot ROM # that we can't provide multicast TFTP (address 0.0.0.0 means no # address). option PXE.mtftp-ip 0.0.0.0; # This is the name of the file the boot ROMs should download. filename "pxelinux.0"; # This is the name of the server they should get it from. next-server 192.168.1.1; } pool { max-lease-time 86400; default-lease-time 86400; range 192.168.1.2 192.168.1.254; deny unknown clients; } host node1 { hardware ethernet fe:ed:fa:ce:de:ad; fixed-address 192.168.1.2; } host node2 { hardware ethernet be:ef:fe:ed:fa:ce; fixed-address 192.168.1.3; } [...] }The above configuration assumes that you want your computers always to come up with the same IP address. If you are completely indifferent (e.g. you have a cluster of identical nodes and you just don't care which is which), you can replace the line "deny unknown clients" with "allow unknown clients" and remove all of the "host" entries. In this case, you should make sure that the "range" of IP addresses is somewhat larger than the number of nodes, since nodes may go down without releasing their leases, and the server won't reap them again until they expire (after one day in the configuration above). In addition, all the nodes must run a daemon to manage the network interface (e.g. pump) so that their DHCP leases will be renewed before they expire.
tftp-hpa, a modified version of the standard BSD TFTP daemon, works just fine (the "-hpa" suffix stands for H. Peter Anvin, who is also the author of the syslinux and pxelinux programs). The latest version is probably the best. Installing it is the usual routine: download the compressed tarball, uncompress and untar it, run the configure script, run make and then make install. In addition, in.tftpd is always run by a meta-daemon, either inetd (e.g. RedHat Linux versions 6.2 and earlier) or xinetd (e.g. RedHat Linux versions 7.0 and later). You will need to make sure that your meta-daemon is configured to provide the TFTP service and to use the right version of the in.tftpd daemon (i.e. the one you just compiled). The relevant files are /etc/inetd.conf for inetd, and /etc/xinetd.d/tftp for xinetd. Whichever meta-daemon you use, it should be configured to invoke the TFTP daemon as follows
in.tftpd -s /tftpbootwhich tells in.tftpd to look in the /tftpboot directory for the files that clients try to download.
The PXE boot ROM passes control to pxelinux after it has already obtained an IP address for itself and the boot server (otherwise, how could it have downloaded pxelinux in the first place?). After it is started by the boot ROM, pxelinux has access to these values in a data structure left in a known location in memory by the boot ROM. Therefore, the first thing pxelinux tries to do is download a configuration file corresponding to the boot client's IP address from the boot server. This configuration file contains the name of the boot image (i.e. the Linux kernel) that pxelinux should download and any kernel parameters that pxelinux should give to it. In addition, if one of these kernel parameters specifies an initial RAM disk for the kernel, pxelinux will download the compressed file system image before starting the kernel.
The pxelinux configuration file for a boot client is found in the directory /tftpboot/pxelinux.cfg and given a name which is the client's IP address in hexadecimal. If the file does not exist when pxelinux tries to download it, it will remove the last octet and try again, repeating until it runs out of octets. For example, if the client was assigned address 192.168.1.2, then it will try to download the following configuration files from the boot server
/tftpboot/pxelinux.cfg/C0A80102 /tftpboot/pxelinux.cfg/C0A801 /tftpboot/pxelinux.cfg/C0A8 /tftpboot/pxelinux.cfg/C0stopping at the first one that succeeds, and giving up if the last one fails. In the case of a network of identical nodes, this allows you to set up a single configuration file for the whole lot of them.
The contents of the pxelinux configuration files look something like the following:
DEFAULT linux APPEND initrd=rootfs.gz root=/dev/ram rw ip=192.168.1.2:192.168.1.1:192.168.1.1:255.255.255.0:node1:eth0:offThe DEFAULT line gives the name of the bootable image file, in this case the client will expect the file /tftpboot/linux on the server to contain a compressed linux kernel image. The APPEND line is a list of parameters passed to the kernel when it boots. The example above deserves some scrutiny.
rdev linux /dev/ram0on the kernel to set the default root file system apropriately.
Alternatively, you can replace the whole thing with ip=dhcp. Then the kernel will begin a second round of DHCP (the first was done by the PXE ROMs before the kernel was downloaded) to obtain its network parameters. It only really makes sense to do this when you want to use the same configuration file for all the nodes on a network (i.e. a file which has a three-octet name such as C0A801). Conversely, if you have a unique file for every host, then the name of the file is the IP address of the host in hexadecimal (see above), so you might as well include the IP address in the contents of the file and save the boot client some trouble.
APPEND root=/dev/nfs nfsroot=192.168.1.1:/export/root/node1,rw ip=[...]ought to do it (the ip= parameter is just the same as above). Warning: make sure the NFS server isn't "root squashing" on this volume or else it will be worthless as a root file system.
To configure the kernel-level configuration facility, the "menuconfig" option to select is in the "Networking options" submenu and is called "IP: kernel-level configuration support". You also get to select which protocols to support (DHCP, BOOTP and RARP). If you plan to use the ip= kernel parameter to specify an autoconfiguration protocol then make sure you have enabled the same protocol in the kernel configuration before you compile. If you plan to use the ip= kernel parameter to fully specify the networking parameters, then you can choose to enable none of the autoconfiguration protocols.
To use a RAM disk root file system, you must enable "RAM disk support" in "Block devices", and you must increase the default RAM disk size to something reasonable like 64 megabytes (remember, it has to hold the whole root file system). An alternative to increasing the default RAM disk size is to add the ramdisk= or ramdisk_size= kernel parameter to the APPEND line, giving it a value which is the actual size of the RAM disk in kilobytes. Also, be sure to enable "Initial RAM disk (initrd) support". This option only becomes visible after you have selected RAM disk support to be compiled into the kernel.
If you plan to use an NFS root file system, then you should also enable "Root file system on NFS" in the "Network File Systems" menu under the "File Systems" menu. This option only becomes visible after you have enabled "IP: kernel-level configuration support".
By and large, I figure you might as well turn off loadable module support completely, since there really won't be room in the tiny root file system that fits in the RAM disk for a lot of them anyway. But I'm sure there are folks out there just waiting to prove me wrong, so I won't tempt them by making a stronger statement on the subject. Suffice it to say that when I boot a node from the network, it gets a bare bones kernel with just the drivers it needs and no loadable module support.
Once you have your kernel configured, build the compressed bootable image with make bzImage and copy this file into /tftpboot. In the examples above I have assumed this file is called /tftpboot/linux.
rpm --root /loop -e cruft-0.95.647-4.7pl1407betawill remove the package named "cruft" from the file system rooted at /loop.
To make life easier for myself, I wrote a makefile as an aid for developing a bare bones root file system:
mount: rootfs mount -o loop rootfs /loop || true newfs: dd if=/dev/zero of=newfs bs=1k count=65536 mke2fs -F -L ROOT newfs < /dev/null copyfs: newfs mount mount -o loop newfs /loop1 cp -a /loop/* /loop1 umount /loop1 umount: umount /loop check: copyfs umount mv newfs rootfs e2fsck -f rootfs compress: check gzip -c rootfs | dd of=/tftpboot/rootfs.gzThis makefile assumes the directories /loop and /loop1 exist, and /loop will be the mount point of the new root file system during development. The makefile implements the following strategy:
make newfs mv newfs rootfs makeand you will have a pristine 64 megabyte ext2 file system mounted on /loop. You should cd into it and make a few directories
cd /loop mkdir bin dev etc lib mnt proc root sbin tmp usr varand then start copying in the files you will need. Determining exactly which files these are is a bit of an art and depends a lot on your application. Some guidelines can be found in the Linux Bootdisk HOWTO. You should plan on mounting /usr read-only from an NFS server (e.g. the cluster frontend). Since most of the important executables are located there, this considerably reduces the size required for the root file system. I have quite usable cluster nodes and X terminals with 64 megabyte root file systems that are half empty.
Once the root file system is fully populated with everything you absolutely need and nothing else (or you're just ready to try another iteration), cd back to the directory containing the makefile and run
make compresswhich will create the file /tftpboot/rootfs.gz. If you want to modify the root file system, just run make mount again and it will reappear under /loop.
For example, on RedHat derived distros, the script /etc/rc.d/rc.sysinit will want to start swapping and fsck all the file systems before mounting them. Since we have no disk drives attached, we have no swapping and nothing to fsck. All of these operations need to be removed from this script before you put it in your RAM disk root file system. It's a chance to really read these scripts and understand everything that happens on boot up. Here's a brief outline of what RedHat runs on startup:
Multicast TFTP is an attempt to allow several clients to download the same file simultaneously through the use of multicast packets. Instead of forking a process for every TFTP client, a multicast server would accept requests from many clients and send file blocks to all of them at once by grouping them into a single multicast address. PXE includes support for multicast TFTP to allow for the possibility of many nodes booting simultaneously; however, tftp-hpa does not.
To avoid overloading the server in the absence of multicast TFTP, one can automate the process of booting nodes sequentially. The etherwake program by Donald Becker can be very helpful by allowing you to use the "Wakeup On LAN" feature available in most modern NICs.
Source RPM | i386 binary | |
---|---|---|
ISC DHCP server v3.0 | dhcp-3.0pl1-1.src.rpm | dhcp-3.0pl1-1.i386.rpm |
tftp-hpa | tftp-hpa-0.30-1.src.rpm | tftp-hpa-0.30-1.i386.rpm |