Beowulf Installation and Administration HOWTO: Designing The Cluster

4. Designing The Cluster

Although the purpose of this document is to cover Beowulf cluster installation and administration, it is important to look at some design aspects, and making a few design decisions, before installing a production system. You will have to choose CPU family and speed, amounts of memory for server and client nodes, and disk size and storage configuration. I will attempt to cover what I think is the best choice in detail, and briefly look at alternative designs.

4.1 Disk

There are at least four methods of configuring disk storage in a Beowulf cluster. These configurations differ in price, performance and ease to administer. This HOWTO will mainly cover the disk-less client configuration, as it is author's preferred disk storage configuration.

Disk-less clients

In this configuration the server serves all files to disk-less clients. The main advantage of disk-less clients system is the flexibility of adding new nodes and administering the cluster. Since the client nodes do not contain any information locally, when adding a new node you will only have to modify a few files on the server or run a script which will do so for you. You don't have to install the operating system or any other software on any of the nodes, but the server node. The disadvantages are increased NFS traffic slightly more complex initial setup, although I have written scripts which automate this task. If you choose a disk-less client configuration you will need to either boot the clients off a floppy disk or off a boot-ROM Ethernet card. Out of the different disk management configuration described in this section, the disk-less clients configurations is the closest to the Beowulf architecture as defined in the Beowulf HOWTO. If you do have spare cash and want to buy a disk for each client node, you can still use the disk-less client disk storage configuration, but use scripts to replicate the OS onto the local client disks. This would give you the flexibility of disk-less client configuration, reduced NFS traffic, and local swap area.

How do the disk-less client nodes boot ?

In a disk-less client configuration client nodes do not know anything about them selves, so how do the client nodes run and know what to do? Let us look at an example where a new node is being booted in a Beowulf cluster. When the power is turned on, client node boots from either a floppy disk or EPROM on the Ethernet card and broadcasts a RARP (Reverse Address Resolution Protocol) request asking for its IP address or "Who am I ?". The server replies with the IP address or "Your name is node64 and your address is 10.0.0.64." The new client continues booting, configures its Ethernet interface, and then mounts NFS file system using the paths provided by the server. The root file system can be mounted on a RAM disk but in most cases it also is mounted as an NFS file system. One of the last things the client node does during its boot procedure is that it tells the server node that it is ready to do work. The server node records this information and can use the new client node to do computations. From now on the client node is controlled by the server and does what is told to.

Fully local install

The other extreme is to have everything stored on each client. With this configuration you will have have to install operating system and all the software on each of the client. The advantage of this setup is no NFS traffic and the disadvantage is a very complicated installation and maintenance. Maintenance of such a configuration could be made easier with complex shell scripts and utilities such as rsync which could update all file system.

Standard NFS Configuration

The third choice is a half way mark between the disk-less client and fully local install configurations. In this setup clients have their own disks with the operating system and swap locally, and only mount /home and /usr/local off the server. This is the most commonly used configuration of beowulf clusters. There are a few methods of installing the operating system on the client nodes, and these will be described in more detail later.

Distributed File System.

Distributed File Systems are file system which exist on multiple nodes. There are many different types of Distributed File Systems, and some are being ported to Linux. Work in this area is still quite experimental, and therefore I shall not talk about it in detail. If you think that you might want to use a Distributed File System in your Beowulf, you could start by reading Implementation and Performance of a Parallel File System for Hight Performance Distributed Applications http://ece.clemson.edu/parl/pvfs/pvfshpdc.ps

4.2 Memory

Amount

Choosing the right amount of memory is one of the most important tasks during design of a Beowulf system. If you don't have enough memory to store the jobs you are going to run, you will loose performance of your system due to extensive swapping. Swapping is what you don't want to see! Every page which has to be read from the hard disk will cost you valuable execution time. Reading from hard disk is much, much slower than from RAM. I have seen huge jobs running on one of our Sparc servers, where they were taking 99.5% of the wall clock time to swap pages in and out from the hard drive, and only 0.5% for actual calculations. Ideally you don't want to swap at all, but you should reserve some swap space just in case you need to run something bigger than planned.

Speed

Speed of your memory is also very important. If you choose a fast CPU running on a fast bus, it will be more than likely that memory will be the bottle neck within the nodes. I recommend using something like 16ns SDRAM.

4.3 CPU

Type

A choice of CPU should be made from two families: Intel x86 compatible and DEC Alpha systems. Other CPUs are supported by Linux, but I don't know of any Beowulf systems using architecture other than Intel or Alpha. In general, Intel based systems are considered as comodity systems because there are multiple sources (Intel, AMD, Cyrix) and obviously ubiquitous. DEC Alpha on the other hand, is a clear performance winner, but does represent a single source part, and therefore can be a bit hard to find at a good price.

It can be argued that Intel "slot based" systems are single source parts, but how this plays in the market is yet to be determined.

Within the Intel based systems, the Pentium Pro and PII have the best floating point performance and they are the only one that support SMP motherboards. A current debate, which will mute by the end of 1998, is whether to use Pentium Pro or PII CPU - the argument being that the PII cache runs at 1/2 clock while the Pentium Pro runs at full clock.) The rule of thumb seems to be that the PII with SDRAM is about the same as the Pentium Pro at the same clock frequency. You application may vary, but since the PII now is approaching 333 MHz the edge seems to be going to the PII. For a detailed discussion, see :

http://www.tomshardware.com/iroadmap.html

http://www.compaq.com/support/techpubs/whitepapers/436a0597.html

SMP

Symmetric Multiprocessor boards are commonly used in Beowulf clusters. The main advantages are lower price per performance, and greater communication speeds between two processors on the same board. The first advantage is very important if you are building a large cluster. By going dual CPU across the whole cluster, you will save on half of the network cards, cases, power supplies, and motherboards. The only cost increase is a more expensive, SMP motherboard, but cost cuts far outweighs this increase.

Even if you decide to go for a single CPU per board, it might be worth while buying an SMP server. Our Topcat system is used by three users and the master node is where they edit, compile, and test their code. It is quite common for our master node to run at load of 2 or higher, fully utilising both CPUs. Because the master node serves file systems to the client nodes, it is important that the NFS server has enough CPU cycles left to perform its duties. One of the tricks you can do is to lower the nice level of nfsd, but to have spare CPU cycles is also important. If you think that your server node will be loaded by users, you should probably consider going for a fast, SMP system.

4.4 Network

Hypercube

Hypercube is a network topology where the vertices of a hypercube are represented by nodes and the edges by network connections. Because of the ever dropping prices of 100 Mbps switches, hypercube is no longer an economical network topology.

10/100 Mbps Switched Ethernet

100 Mbps switched, full duplex Ethernet is the most commonly used network in Beowulf systems, and gives almost the same performance as a fully meshed network. Unlike standard Ethernet where all computers connected to the network compete for the cable and cause collisions of packets, switched Ethernet provides dedicated bandwidth between any two nodes connected to the switch. Before you purchase Fast Ethernet network cards for your cluster, you should check Linux Network Drivers http://cesdis.gsfc.nasa.gov/linux/drivers/index.html

In most cases, nodes in a beowulf cluster use private IP addresses. The only node which has a "real" IP address, visible from the outside world, is the server node. All other nodes (clients) can only see nodes with in the beowulf cluster. An example of a five-node beowulf cluster is shown below. As you can see, node1 has two network interfaces, one for the cluster, and one for the outside world. I use the 10/8, private IP range, but others like 172.16/12, and 192.168/16 can also be used (please see RFC 1918 http://www.alternic.net/nic/rfcs/1900/rfc1918.txt.html)

As simple, five node Beowulf cluster might look something like this:

     Your LAN       
                   |        
                   | eth0 123.45.67.89       
                   | 
                [node1]
                   | 
                   | eth1 10.0.0.1
      Cluster      |
                   |
                   |
                 ______
   10.0.0.2     /      \      10.0.0.5
[node2]---------|SWITCH|---------[node5]
                \______/
                 |    |
                 |    |
        10.0.0.3 |    | 10.0.0.4
            [node3]  [node4]

4.5 Which Distribution ?

The most commonly used Linux distribution on Beowulf system is Red Hat Linux. It is simple to install and is available via FTP from the Red Hat FTP server ( ftp://ftp.redhat.com) or any of its mirrors. All information which follows is based on Red Hat Linux 5.2 distribution which is current at the time of writing this HOWTO. If you are using Debian, Slackware or any other Linux distribution you will not be able to follow my instructions verbatim.

One of the main advantages of Red Hat Linux is the fact that it uses RPM (Red Hat Package Manager) for installation/upgrade/removal of all the packages. It is therefore very easy to install, UN-install or upgrade any package. Some of the Beowulf software like PVM, MPI and the Beowulf kernel are also available in RPM format.

NOTE: The Original Extreme Linux CD is now considered OBSOLETE. Other than documentation, the RPMs on the CD should not be used.

4.6 Buying Hardware

The following recommendations apply to any hardware purchase. Building a Beowulf, however, requires that you are extra careful because hardware decisions will be repeated across all your nodes and a mistake can be costly.

The Problem with commodity hardware:

Here is a great way to make money - buy a 166 MHz CPU, change the markings on it to read 233 MHz, and sell it for several hundred dollars more. Or, sell inferior DRAM as high quality DRAM. Because of the commodity nature of the business, uses expect "plug and play" components. Unfortunately, the amount of money that can be made by "re-mark" components provides a large incentive to the dishonest hardware vendor. Another way to make money is to build motherboards with low quality electrolytic (capacitors). Bargain basement "no name" motherboards often are built with these parts. The manufacture can save $20-$30 per board, but in 1-2 years the board may be worthless.

Although CPUs can be over-clocked, slow DRAMs can sometimes work, and cheap motherboards will work for a while, these parts are not operating within the manufactures specifications and can create problems. Purchasing these types of components may eventually lead you to "re-stocking fees" (a percentage of the total cost, at least 15%, kept by the hardware vendor if you return a product) and a whole plethora of hardware problems.

The Solution:

First, prices for hardware that are too good to be true probably are. Stay away from these small vendors with "to good to be true prices" unless they have been in business for several years and meet all of the other requirements. Second, insist on at least three things from your hardware vendor:

At a minimum a 3-4 year warranty on all CPU's and DRAM. A lifetime warranty is better, but in reality after 3-4 years the part will probably not be manufactured or you will not really care. Any hardware vendor that sells good quality parts will offer a warranty.
Do not use a vendor that has "re-stocking fees". These fees allow a vendor to sell you low quality parts and then charge you at least 15% of the cost if you have a problem!
Ask the vendor if they are authorized dealers for any of the products they sell. (Authorized Intel dealers have a identification number). If they are not, they are probably buying parts from where any where they can. Which means they may have no idea as to the quality or history of the parts.

Finally, stay away from "no-name" clone products like motherboards, video cards, network controllers, etc. The few dollars you save, may cost you weeks of problems later on. Indeed, building a Beowulf has a greater temptation because now you are multiplying any small savings by the number of boxes you will use (i.e. "Do I buy 16 of these no-name NIC cards for $55/each or do I buy 16 of these Major Brand cards for $75/each?")

These are just a few things to consider when buying hardware. Although the PC market has produced "standardization" and thus competition, this has also allowed inferior parts to misrepresented and sold to end users. Caveat Emptor (Buyer Beware)

Next Previous Contents