Workshop on Personal Computers based Networks Of Workstations

Abstracts of Papers

PUPA: A Low-Latency Communication System for Fast Ethernet

by M. Verma and T.-c. Chiueh, Computer Science Department, SUNY at Stony Brook, USA.

Abstract

Pupa is a low-latency communication system that provides the same quality of message delivery as TCP but is designed specifically for a parallel computing cluster connected by a 100 Mbits/sec Fast Ethernet. The implementation has been fully operational for over a year, and several systems have been built on top of Pupa, including a compiler-directed distributed shared virtual memory system, Locust, and a parallel multimedia index server, PAMIS. To minimize buffer management overhead, Pupa uses per-sender/per-receiver fixed-sized FIFO buffers to optimize for the common case, rather than a shared variable-length linked-list buffer pool. In addition, Pupa features a sender-controlled acknowledgement and an optimistic flow control scheme to reduce the overhead of providing reliable in-order message delivery. Our performance results show that Pupa is more than twice as fast as the fast path of TCP in terms of latency, and is about 1.5 times better in terms of throughput. This paper presents the design decisions during the development of Pupa, the results of a detailed performance study of the Pupa prototype, as well as the implementation experiences from Pupa-based applications development.

Optimal Communication Performance on Fast Ethernet with GAMMA

by G. Ciaccio, DISI, University of Genoa, Italy.

Abstract

The current prototype of the Genoa Active Message MAchine (GAMMA) is a low-overhead, Active Messages-based inter-process communication layer implemented mainly at kernel level in the Linux Operating System. It runs on a pool of low-cost Pentium-based Personal Computers (PCs) networked by a low-cost 100base-TX Ethernet hub to form a low-cost message-passing parallel platform. In this paper we describe in detail how GAMMA could achieve unprecedented communication performance (less than 13 microsec. one-way user-to-user latency time and up to 98% of the communication throughput of the raw interconnection hardware) on such a kind of low-cost parallel architecture.

BIP: a new protocol designed for high performance networking on Myrinet

by L. Prylli and B. Tourancheau, LIIPC & INRIA ReMaP, LIP, ENS-Lyon, France.

Abstract

High speed networks are now providing incredible performances. Software evolution is slow and the old protocol stacks are no longer adequate for these kind of communication speed. When bandwidth increases the latency should decrease as much in order to keep the system balance. With the current network technology the main bottleneck is most of the time the software that makes the interface between the hardware and the user. We designed and implemented new protocols of transmission targeted first to parallel computing that squeeze the most out of the high speed Myrinet network without wasting time in system calls or memory copies, giving all the speed to the applications. This design is presented here as well as experimental results that lead to achieve real Gigabit/s throughput and less than 5 microsec latency on a cluster of PC workstations with this "cheap" network hardware. Moreover, our networking results compare favorably with the expensive parallel computers or ATM LANs.

COMPaS: A High Performance Pentium Pro PC-based SMP Cluster

by Y.Tanaka, M.Matsuda, M.Ando, K.Kubota, and M.Sato, Massively Parallel Perf. Lab., Real World Computing Partnership, Japan.

Abstract

We have built COMPaS, a cluster of SMPs, which consists of eight quad-processor Pentium Pro. We designed and implemented a remote memory based user-level communication layer which provides low-overhead and high-bandwidth using Myrinet. To take advantage of the locality in each SMP node, we integrated multi-threaded programming with Solaris threads for intra-nodes and message passing/remote memory operations based programming for inter-nodes. In this paper, we reported the basic performance of COMPaS, design, implementation, and the performance of our communication primitives, a hybrid shared memory/distributed memory programming on COMPaS and its preliminary evaluation, and the performance characteristics of COMPaS.

PULC: ParaStation User-Level Communication. Design and Overview

by J.M. Blum, T.M. Warschko, and W.F. Tichy, Dept. of Informatics, University of Karlsruhe, Germany

Abstract

PULC is a user-level communication library for workstation clusters. PULC provide a multi-user, multi-programming communication library for user level communication on top of high-speed communication hardware. In this paper, we describe the design of the communication subsystem, a first implementation on top of the ParaStation communication card, and benchmark results of this first implementation. PULC removes the operating system from the communication path and offers a multi-process environment with user-space communication. Additionally, we have moved some operating system functionality to the user level to provide higher efficiency and flexibility. Message demultiplexing, protocol processing, hardware interfacing, and mutual exclusion of critical sections are all implemented in user-level. PULC offers the programmer multiple interfaces including TCP user-level sockets, MPI, PVM, and Active Messages. Throughput and latency are close to the hardware performance (e.g., the TCP socket protocol has a latency of less than 9 us).

Eliminating the Protocol Stack for Socket based Communication in Shared Memory Interconnects

by S.J. Ryan and H. Bryhni, Dept. of Informatics, University of Oslo, Norway.

Abstract

We show how the traditional protocol stack, such as TCP/IP, can be eliminated for socket based high speed communication within a cluster. The SCI shared memory interconnect is used as an example, and we demonstrate how existing applications can utilize the new technology without relinking. This is done by dynamically remapping the TCP/IP socket implementation to our high performance SCILAN sockets. We describe a novel mechanism for synchronization of communication through shared memory, aimed at minimizing the interrupt load on the receiving system. We discuss the implementation and present an evaluation with comparison to alternative technologies, such as 100baseT and ATM. Significant improvement over current solutions are shown both in terms of throughput and latency.

Porting of a Molecular Dynamics application on GAMMA

by G. Ciaccio and V. Di Martino, CASPUR c/o Universita' "La Sapienza", Roma, Italy.

Abstract

GAMMA is the abbreviation of Genoa Active Message Machine, a low cost cluster of Personal Computers (PCs) with commodity (100base-T Ethernet) network hardware and a enhanced Linux operating system to obtain low latency on message exchange based on the Active Message paradigm. The target of our work is the porting on GAMMA of a communication intensive Molecular Dynamics code used to study polarizable fluids. First of all the GAMMA library routines has been extended to permit calls from Fortran code. The second steps has been the modification of the communication policy in the original PVM code to fit in the hardware and software communication layer. The third step, still in progress, is the benchmarking and the performance comparison among GAMMA, a 100base-T Ethernet LAN of PC running PVM, and a cluster of SMP computers with proprietary network solutions. In this paper we will discuss the difficulties and the adopted solutions to port the application on GAMMA while preserving the original speedups and overall performance obtained on vendor NOWs of higher class of costs.

MPI on NT: A preliminary evaluation of the available systems

by M. Baker and G. Fox, CSM, University of Portsmouth, UK

Abstract

The aim of this paper is to discuss the functionality and performance of the current generation of MPI environments that are available for NT. The three environments investigated are WinMPICH from the Engineering Research Center at Mississippi State University, WMPI from the Instituto Superior de Engenharia de Coimbra, Portugal and FM-MPI from the Dept. of Computer Science at the University of Illinois at Urbana-Champaign. In the first part of the paper we discuss briefly the advantages of using clusters of workstations and then move on to describe NT and the MPI environments being investigated. In the second part of the paper we describe, and then report on our experiences of assessing the functionality of the MPI environments. In the third part of the paper we make a preliminary evaluation of the performance characteristics of the environments assesses using the ParkBench benchmark suite. Finally, we summarise our findings and suggest a number of improvement that could be made to the environments assessed.

chiola@disi.unige.it, Jan. 19, 1998