line

DISI Dipartimento di Informatica e Scienze dell'Informazione

line


GAMMA Project: Genoa Active Message MAchine



What's New: support for dual-CPU nodes, more process instances and more ports allowed (updated 13 August 2004)




Index




What is GAMMA?



Network Of Workstations (NOWs) and clusters of fast, well-equipped Personal Computers (PCs), interconnected by a standard LAN hardware (like Fast Ethernet, Gigabit Ethernet, Myrinet), and running the Linux operating system, are becoming more and more attractive as cheap and efficient platforms for parallel and distributed applications. The main drawback of a standard NOW/cluster architecture is the poor performance of the standard support to inter-process communication over any LAN hardware. Current implementations of industry-standard communication primitives (RPC), APIs (sockets), and protocols (TCP, UDP) usually show high communication latencies and low communication throughput.

We have developed an experimental system for inter-process communication, called the Genoa Active Message MAchine (GAMMA). GAMMA runs on clusters of PCs, based on Intel as well as AMD CPUs (Intel Pentium, AMD K6, and superior models), connected by Fast Ethernet or Gigabit Ethernet. Such a cluster can be used either as a set of autonomous workstations (in a student lab) or as a parallel computer for Single Program Multiple Data (SPMD) as well as MIMD applications.

The core of GAMMA is a custom device driver under Linux, which operates the Network Interface Card (NIC). The GAMMA driver delivers very low latency, high bandwidth communications using Active Ports, a mechanism derived from Active Messages. Both point-to-point and broadcast communications are provided. Broadcast communication exploits the Ethernet broadcast directly.

The GAMMA driver is able to manage standard IP traffic in addition to GAMMA fast communications. This means that using GAMMA on your LAN will not stop the standard UNIX network services.

The communication mechanisms implemented in the GAMMA driver are made available to application writers through the GAMMA user library. The GAMMA library provides support to application launch, process grouping, point-to-point/broadcast communications based on the Active Ports mechanisms, and some collective routines (barrier synchronization, gather, all-gather; more collective routines are being developed).

GAMMA provides two levels of QoS. The lower QoS level, corresponding to the fastest communications, is a best-effort service. With this service, network congestion and ``hot spots'' may cause the receiver NIC or even the LAN switch to loose packets by overrun. The other QoS level provides flow-controlled communication, therefore ensuring reliability up to hardware faults, at a negligible performance penalty.

Installing the GAMMA driver requires only very marginal changes to the original Linux kernel. The Linux kernel extended with the GAMMA driver has to run on each PC in the cluster.

The current version of GAMMA has been tested with a number of NICs, both Fast Ethernet (DEC 21143 chipset, 3COM 3c905) and Gigabit Ethernet (Alteon TIGON-II chipset, Netgear GA621, 3COM 3c996).

A porting of MPI atop GAMMA is now available.




People involved in the GAMMA project


External collaborations




Hardware requirements


A pool of PCs (Intel Pentium, AMD K6, or superior models), single-CPU as well as dual-CPU.

Each PC should have a NIC chosen in the following list:

The PCs must be networked together by a Gigabit Ethernet switch or a Fast Ethernet switch or hub.

In our own installation, the PCs are also networked by an additional LAN for standard IP traffic and services. Such additional LAN is not strictly required by GAMMA, as the GAMMA network device driver can manage both GAMMA and IP communications on the same LAN; however, separating GAMMA communications from IP will significantly boost performance.




Are GAMMA communications reliable?


GAMMA provides two levels of Quality of Service (QoS). The basic one is a ``best-effort'' transmission, with no flow control, implemented by functions gamma_send() , gamma_send_2p() , gamma_isend(), gamma_isend_2p() . The other one is a ``flow controlled'' transmission, implemented by functions gamma_send_flowctl() , gamma_send_2p_flowctl() , gamma_isend_flowctl(), gamma_isend_2p_flowctl() .

In all cases, however, the GAMMA protocol implements policies for detecting packet losses and is able to retransmit missing packets. GAMMA is not particularly efficient in recovering from packet losses; indeed, we argue that the probability of a packet loss due to a hardware error is negligible, and the possibility of loosing packets due to LAN congestion is better managed by preventing congestion from occurring (for instance, by using smart communication patterns in parallel jobs). In other words, GAMMA does provide packet retransmission because it is of great help indeed, but does not take care of efficiency when retransmission occurs because this is supposed to be rare.




Performance


We report here the measurement of the communication performance achieved at the application level. The following definitions hold:

Message delay, D(S), is half the round-trip time with a message of size S bytes, as measured by running a Ping-Pong GAMMA microbenchmark.

Latency is the message delay D(Smin) where Smin is the smallest message size allowed by the communication system. Smin = 0 with GAMMA, whereas Smin = 1 with TCP. With GAMMA, a zero-byte message is written to the NIC as a GAMMA frame header with no frame body; the NIC transmits it as a GAMMA frame header padded with a ``garbage'' body, to match the 72 byte min. frame size required by the IEEE 802.3 standard.

End-to-end throughput, T(S), is the transfer rate of the whole communication path, from the sender to the receiver:
T(S) = S/D(S).
Note: This definition implies that the throughput is measured using the ping-pong microbenchmark. Other techniques for throughput measurement are based on the transmission of long streams of messages from sender to receiver without any data flowing back; such techniques will measure a different thing, that we call transmission throughput (see below).

Asymptotic bandwidth, B, is the value of the following ``bandwidth'' function B(S) as the message size S > 0 approaches infinity: B(S) = (S - Smin)/(D(S) - D(Smin)). Since B(S) - T(S) approaches zero as S approaches infinity, B is commonly evaluated by measuring the end-to-end throughput T(S) with S very large.

Transmission throughput is the transfer rate perceived at the sender side, that is, the data rate at which an infinite stream of messages of fixed size can be pushed into the network without causing any data loss. This requires to run a different microbenchmark compared to ping-pong: Rather than exchanging a single message back and forth, the sender transmits a long stream of messages to the receiver and measures the time spent for the whole transmission. Clearly, the transmission throughput can be greater than the end-to-end throughput as it not necessarily takes into account the overhead and possible bottlenecks at the receiver side of communication.





Fast Ethernet (100 Mbps)

We report the performance numbers and graphs of the gamma_send_flowctl() send routine, which provides flow-controlled data transfer. Best-effort routines like gamma_send() offer marginally better performance with less reliability.

In a cluster of two single-CPU PCs, each with
AMD Athlon K7 500 MHz CPU,
100 MHz memory bus,
Linux 2.2.13 with the GAMMA device driver
connected by a crossover UTP cable, and running a GAMMA ``ping pong'' program, we get the following average communication performance at user level:

NIC Latency Asymptotic bandwidth Throughput graphs
DEC DE-500 BA (DEC 21143 chipset) 14.2 µs 12.1 MByte/s end-to-end throughput,
transmission throughput (unidirectional stream)
3COM 3c905C 17.7 µs 12.1 MByte/s end-to-end throughput,
transmission throughput (unidirectional stream)
Intel EtherExpress Pro/100 24.5 µs 12.1 MByte/s

It is clear from the above numbers that GAMMA exploits about 96% of the nominal link speed of Fast Ethernet (12.5 MByte/s) with long messages.

This performance remarkably improves over the communication performance of Linux TCP/IP sockets. Indeed, GAMMA is currently the most performing messaging system running on low-cost NOW configurations, yields the best price/performance ratio in the whole range of NOW architectures, and rivals many communication layers running on commercial multiprocessors, according to the table below:

Platform
Interconnection
network
Latency
(µs)
Asymptotic bandwidth
(MByte/s)
Fast Messages Myrinet (1.28 Gbit/s)

9.6

100.5

GAMMA 100base-T (DEC 21143 chipset), K7 500

14.3

12.1

Fast Messages 1000base-F (Packet Engines GNIC-II?)

14.7

81.5

M-VIA 100base-T (DEC 21143 chipset), Pentium II 400

23

11.9

U-Net 100base-T (DEC 21140 chipset), Pentium

30.0

12.1

U-Net 140 Mbit/s ATM

35.5

14.8

Linux 2.0.36 TCP/IP sockets 100base-T (DEC 21143 chipset), Pentium II 350

58

10.5

Thinking Machines CM-5 CMAML ports custom (Fat Tree)

15.0

8.5

IBM SP-2 MPL custom

44.8

34.9

Cray T3D PVMFAST custom

30.0

25.1




Gigabit Ethernet (1000 Mbps)

We report the performance numbers and pictures of the gamma_send_flowctl() send routine, which provides flow-controlled data transfer. Best-effort routines like gamma_send() offer marginally better performance with less reliability.

In a cluster of two single-CPU PCs, each with
Pentium III CPU @1GHz,
133 MHz FSB,
66 MHz 64 bit PCI bus,
SuperMicro 370DE6 motherboard, ServerSet III HE-SL chipset
Linux 2.4.16 with the GAMMA device driver
connected back-to-back by an fiber-optic cable, and running a GAMMA ``ping pong'' program, we get the following average communication performance at user level:

NIC Latency Asymptotic bandwidth Throughput pictures
NetGear GA620 (Alteon TIGON-II) 32 µs 82 MByte/s (normal frames),
123.6 MByte/s (Jumbo frames, MTU 5140 bytes)
end-to-end throughput,
NetGear GA621 8 µs 118.5 MByte/s (normal frames),
122 MByte/s (Jumbo frames, MTU 4116 bytes)
curves of end-to-end throughput and transmission throughput coming soon
3COM 3c996 12 µs 102 MByte/s (Jumbo frames) curves of end-to-end throughput and transmission throughput coming soon

With the Netgear GA621, GAMMA achieves a 95% exploitation of the nominal link speed of Gigabit Ethernet (125 MByte/s). This is a very good result, given the very small size of the standard MTU (1500 bytes). Increasing the MTU size up to 4116 bytes (``Jumbo Frames'') allows to achieve almost 98% of the nominal link speed.

GAMMA improves quite a lot over the communication performance of Linux TCP/IP sockets on the same interconnect. Indeed, GAMMA is currently the most performing messaging system for Gigabit Ethernet, according to the table below:

Platform
Interconnection
network
Latency
(µs)
Asymptotic bandwidth
(MByte/s)
BIP Myrinet (1.28 Gbit/s), Pentium Pro

4.3

125.6

PM Myrinet (1.28 Gbit/s), Pentium 166 MHz

7.5

113.5

GAMMA 1000base-F (Negear GA621), Pentium III 1 GHz

8.5

122

Fast Messages Myrinet (1.28 Gbit/s)

9.6

100.5

Fast Messages 1000base-F (Packet Engines GNIC-II?)

14.7

81.5

M-VIA 1000base-F (Packet Engines GNIC-II), Pentium II 400

19

60

GigaE-PM 1000base-F (Essential EC-440-SF), Pentium 150 MHz

24

58.3

GAMMA 1000base-F (Netgear GA620, Alteon TIGON-II), Pentium III 1 GHz

32

123.6

Linux 2.1.131 TCP/IP sockets (quoted from M-VIA FAQs) 1000base-F (Packet Engines GNIC-II), Pentium II 400

59

31

Note: The above performance table reports performance numbers from messaging systems running on two very different Gigabit technologies, namely Gigabit Ethernet, and Myrinet. Comparing a Gigabit Ethernet-based communication layer to a Myrinet-based one is not necessarily fair: Gigabit Ethernet measurements are usually taken by connecting two machines back-to-back; this is not the case with Myrinet. A fair comparison should take the hardware latency of a Gigabit Ethernet switch into account (3 - 4 usec).







Papers





How to install


The requirements to install GAMMA are:

If you plan to install GAMMA, please read the README_GAMMA file before starting the installation. The GAMMA source distribution itself contains a quite complete manual for installation and troubleshooting. Should you need additional information (e.g. you find the installation instructions unclear or incomplete), you can contact Giuseppe Ciaccio, ciaccio@disi.unige.it.

The current release of GAMMA for Linux 2.4.21 is dated 13 August 2004.


NOTE: This is the last release of GAMMA for Linux kernels of the 2.4 series. Next release will be for Linux 2.6.7.

Supported NICs:




Known Bugs


GAMMA allows more process instances of the same parallel job to run on the same CPU. Thread safety is granted with point-to-point communications when running on distinct GAMMA ports. However, collective routines (barrier synchronization, and broadcast) are not thread compatible, as they use predetermined GAMMA ports.

The GAMMA driver for 3COM 3c905C may show a transient failure the first time a GAMMA (or MPI/GAMMA) job is launched after a boot-up of the cluster.

The packet retransmission mechanism is not yet perfect. It will not work if the missing packet originates from a non-blocking send.




Please send suggestions and comments to:

Giuseppe Ciaccio, ciaccio@disi.unige.it


Last Updated: 13 August 2004