Dipartimento di Informatica e Scienze dell'Informazione

GAMMA: The Genoa Active Message MAchine

What's New:
GAMMA module for Linux 2.6.24, more debug (updated 15 April 2009)

Index

What is GAMMA?
Download and install MPI/GAMMA, a porting of MPICH to the GAMMA messaging system
The GAMMA maintainers
Hardware requirements
Are GAMMA communications reliable?
Performance
Download and install
Known Bugs
Acknowledgements
The API of GAMMA

What is GAMMA?

A low latency, high throughput communication system for clusters of PCs
Supports both single and dual CPU processing nodes (Intel IA-32 or amd64 x86-64)
Runs on Gigabit Ethernet
SPMD parallel processing with message passing
Good programmability thanks to fairly high abstraction level
Reliable thanks to mechanisms for retransmission of missing packets
Implemented as a network device driver for Linux 2.6, and released under GNU GPL

The Typical drawback of a Gigabit Ethernet cluster is the poor performance of inter-process communication over the interconnect. Current implementations of industry-standard communication primitives, APIs, and protocols, usually show high communication latencies and sub-optimal communication throughput.

The Genoa Active Message MAchine (GAMMA) is a low-latency protocol for Gigabit Ethernet clusters running Linux. GAMMA runs on IA-32 processors (Intel Pentium, AMD K6, and superior models) and their 64-bit extensions (AMD Athlon64, AMD Opteron, Intel EM64T). Multi-CPU and multi-core nodes are supported.

The core of GAMMA is a custom Linux network device driver, which operates the Network Interface Card (NIC). The GAMMA driver allows low latency, high throughput inter-process communications, using a mechanism derived from Active Messages. Both point-to-point and broadcast communications are provided. Broadcast communication exploits the Ethernet broadcast natively.

The GAMMA driver is unable to manage standard IP traffic. Therefore, all IP services of the cluster must be supported by an additional LAN (an additional NIC on each cluster node, plus an additional hub/switch).

The communication mechanisms implemented in the GAMMA driver are made available to application writers through the GAMMA user library. The GAMMA library provides support to application launch, process grouping, point-to-point/broadcast communications based on the Active Ports mechanisms, and some collective routines (barrier synchronization, and broadcast).

GAMMA provides two levels of QoS. The lower one, corresponding to the fastest communications, is a best-effort service. With this service, network congestion and ``hot spots'' may cause the receiver NIC or even the LAN switch to loose packets by overrun. The other QoS level provides flow-controlled communication, ensuring reliability up to hardware faults, at a negligible performance penalty.

GAMMA is compiled as a Linux module for Linux 2.6.24. Therefore, the 2.6.24 kernel must be installed on all cluster nodes before loading the GAMMA module.

A porting of MPICH-1 atop GAMMA is available. See the MPI/GAMMA web page.

The GAMMA maintainers

Giuseppe Ciaccio

Hardware requirements

You need a cluster with IA-32 processors (Intel Pentium, AMD K6, and superior models) or their 64-bit extensions (AMD Athlon64, AMD Opteron, Intel EM64T). Multi-CPU and multi-core nodes are allowed.

For the main interconnect (namely, the one where GAMMA communications take place), each PC should have a Gigabit Ethernet card chosen in the following list:

Various NICs equipped with one of the Intel 8254x or 8257x chipsets
Various NICs (by Broadcom, Dell, 3COM) equipped with one of the Broadcom ``Tigon3'' chipsets.

You also need a Gigabit Ethernet switch to connect the nodes (or a back-to-back cable, for just two nodes).

An additional LAN, of any kind, is required to allow running IP traffic separate from GAMMA traffic. Without such additional LAN, all IP services (and especially NFS) would be disabled. GAMMA uses IP to spawn the processes of a parallel job, so the additional LAN is really necessary.

Are GAMMA communications reliable?

GAMMA provides two levels of Quality of Service (QoS). The basic one is a ``best-effort'' transmission, with no flow control, implemented by functions gamma_send() , gamma_send_2p() , gamma_isend(), gamma_isend_2p() . The other one is a ``flow controlled'' transmission, implemented by functions gamma_send_flowctl() , gamma_send_2p_flowctl() , gamma_isend_flowctl(), gamma_isend_2p_flowctl() .

In all cases, however, the GAMMA protocol implements policies for detecting packet losses and is able to retransmit missing packets. GAMMA is not particularly efficient in recovering from packet losses; indeed, we argue that the probability of a packet loss due to a hardware error is negligible, and the possibility of loosing packets due to LAN congestion is better managed by preventing congestion from occurring (for instance, by using smart communication patterns in parallel jobs). In other words, GAMMA does provide packet retransmission because it is of great help indeed, but does not take care of efficiency when retransmission occurs because this is supposed to be (made) rare.

Performance

Application-level performance comparisons: OpenFOAM 1.4.

And here, for the common folks, some latency-bandwidth numbers (taken at user level).

Definitions:

Message delay, D(S), is half the round-trip time with a message of size S bytes, as measured by running a simple ping-pong GAMMA microbenchmark.

Latency is the message delay D(Smin) where Smin is the smallest message size allowed by the communication system. GAMMA allows Smin = 0, TCP allows Smin = 1.

End-to-end throughput, T(S), is the transfer rate of the whole communication path, from the sender to the receiver:
T(S) = S/D(S).
Note: This definition implies that the throughput is measured using the ping-pong microbenchmark. Other techniques for throughput measurement are based on the transmission of long streams of messages from sender to receiver without any data flowing back; such techniques will measure a different thing, namely the transmission throughput (see below).

Asymptotic bandwidth, B, is the value of the following ``bandwidth'' function B(S) as the message size S > 0 approaches infinity: B(S) = (S - Smin)/(D(S) - D(Smin)). Since B(S) - T(S) approaches zero as S approaches infinity, B is commonly evaluated by measuring the end-to-end throughput T(S) with S very large.

Transmission throughput is the transfer rate perceived at the sender side, that is, the data rate at which an infinite stream of messages, of fixed size each, can be pushed into the network without causing any data loss. This requires to run a different microbenchmark compared to ping-pong: rather than exchanging a single message back and forth, the sender transmits a long stream of messages to the receiver and measures the time spent for the whole transmission.

Half-power point, is the message size H at which the throughput T(H) reaches half the asymptotic bandwith B. The value of H depend on which notion of throughput one has in mind. We refer here to the end-to-end throughput.

Two testbeds have been used for these measurements:

Testbed 1: a cluster of dual-Xeon 2.8 GHz nodes (32-bit architecture) with an Intel PRO/1000 adapter (hardwired onto the mainboard). This cluster is networked by a 28+4-way Extreme Network Summit 7i Gigabit Ethernet switch.
Testbed 2, courtesy of Tony Ladd (Chemical Engineering Univ. of Florida): a cluster of dual-Pentium D 3.0 GHz nodes (64-bit architecture) with a Broadcom NetXtreme BCM5721 adapter (hardwired onto the mainboard). This cluster is networked by an Extreme Network Gigabit Ethernet switch.

Here are the performance numbers (best among three runs of a ping-pong, each providing an average over 50 trials; GAMMA flow control enabled; performance at user level):

Testbed NIC Latency Latency Asymptotic bandwidth Half-power point Half-power point

(back to back) (switch incl.) (back to back) (switch incl.)

1 Intel PRO/1000 6.1 µs 10.8 µs 123.4 MByte/s 1820 byte 7600 byte

(MTU 4120 bytes)

2 Broadcom NetXtreme 14.6 µs ?? µs 119.4 MByte/s 3705 byte ?? byte

(MTU 1500 bytes)

Download and install

The requirements to install GAMMA are:

Cluster nodes with IA-32 processors (Intel Pentium, AMD K6, and superior models) or or their 64-bit extensions (AMD Athlon64, AMD Opteron, Intel EM64T). Multi-CPU and multi-core machines are allowed.
Gigabit Ethernet network adapters supported by GAMMA (one adapter on each PC). For a list of supported NIC families, see the Hardware requirements.
a Gigabit Ethernet switch (a ``back-to-back'' connection can be arranged for just two machines)
an additional LAN, for standard IP traffic; this is necessary, because GAMMA uses IP when launching parallel jobs, and the GAMMA driver itself is unable to support IP services.
Linux 2.6.24

If you plan to install GAMMA, please read the README_GAMMA file before beginning the installation. The GAMMA source distribution itself contains a quite complete manual for installation and troubleshooting. Should you need additional information (e.g. you find the installation instructions unclear or incomplete) contact the maintainer, Giuseppe Ciaccio, ciaccio AT disi.unige.it.

Here is the current release of GAMMA:

GAMMA for Linux 2.6.24, dated 15 April 2009.
Supported NICs:
- NICs equipped with one of the Intel 8254x or 8257x chipsets
- NICs equipped with one of the Broadcom ``Tigon3'' chipsets.

Known Bugs

GAMMA allows more process instances of the same parallel job to run on the same CPU. Thread safety is granted with point-to-point communications when running on distinct GAMMA ports. However, collective routines (barrier synchronization, and broadcast) are not thread compatible, as they use predetermined GAMMA ports.

The packet retransmission mechanism is not perfect. It will not work if the missing packet originates from a non-blocking send.

Acknowledgements

Over time, some of the GAMMA users contributed much to improve GAMMA usability and stability. I hereby acknowledge invaluable contribution from the following individuals or companies:

OpenCFD, Ltd (UK)
Financial contribution, an interesting parallel application to play with (OpenFOAM), plus some access to their hardware resources.
Tony Ladd, Professor, Chemical Engineering Univ. of Florida (USA)
A scientist, but also a great computer geek! Lots of ideas, bug reports, and pieces of code.
Mike Leuthold, Institute of Atmospheric Physics, Univ. of Arizona (USA)
Access to his fairly large cluster, a nice weather forecast testbed (WRF) for MPI/GAMMA, and a lot of patience :-)
Marco Maniezzo, Politecnico di Torino
He was author of a former GAMMA driver for the Broadcom BCM5700 chipset (``Tigon3'' family), now replaced by a new driver.
Institut fur Informatik, Universitat Potsdam:
Marco Ehlert (Master Student), under supervision of prof. Bettina Schnor.
Marco was the author of the GAMMA driver for the Netgear GA621/GA622 Gigabit Ethernet adapter (no longer maintained).

Please send suggestions and comments to:

Giuseppe Ciaccio, ciaccio AT disi.unige.it

Testbed	NIC	Latency	Latency	Asymptotic bandwidth	Half-power point	Half-power point
		(back to back)	(switch incl.)		(back to back)	(switch incl.)
1	Intel PRO/1000	6.1 µs	10.8 µs	123.4 MByte/s	1820 byte	7600 byte
				(MTU 4120 bytes)
2	Broadcom NetXtreme	14.6 µs	?? µs	119.4 MByte/s	3705 byte	?? byte
				(MTU 1500 bytes)