The GAMMA API

Dipartimento di Informatica e Scienze dell'Informazione

GAMMA: The Genoa Active Message MAchine

The Application Programming Interface (API) of GAMMA

The GAMMA communication library provides functions for process grouping, point-to-point communication, and collective communications at the application level. Both C and FORTRAN calls are provided. Here we describe only the C interface.

This is a list of the GAMMA library functions and variables.

Initiate/terminate parallel section of a job:
`gamma_init()`	`gamma_exit()`
Set up communication ports:
`gamma_set_active_port()`	`gamma_set_passive_port()`
`gamma_post_recv()`
Send routines, blocking:
`gamma_send()`	`gamma_send_flowctl()`
`gamma_send_2p()`	`gamma_send_2p_flowctl()`
Send routines, non-blocking:
`gamma_isend()`	`gamma_isend_flowctl()`
`gamma_isend_2p()`	`gamma_isend_2p_flowctl()`
`gamma_wsend()`	`gamma_tsend()`
Synchronize on message arrivals:
`gamma_signal()`	`gamma_sigerr()`
`gamma_wait()`	`gamma_test()`
Miscellaneous:
`gamma_atomic()`	`gamma_sync()`
`gamma_my_par_pid()`	`gamma_my_node()`
`gamma_how_many_nodes()`	`gamma_mlock()`
`gamma_munlock()`	`gamma_munlockall()`
`gamma_time()`	`gamma_time_diff()`
`gamma_active_port`	`gamma_msglen`

GAMMA functions are built on top of a small set of custom system calls, activated using the trap address 0x81, which traps down to kernel in the GAMMA device driver through a short and fast code path.

Each library function, with the exception of gamma_time() and gamma_time_diff(), returns a negative integer value in case of error, and a non-negative integer value in case of successful completion.

The programming interface is currently defined as follows:

int gamma_init ( unsigned char num_nodes, int argc, char **argv );

A parallel computation is started.

As a sequential user process P invokes it, a process group called virtual GAMMA is activated. The group is composed of process P, running on the local workstation, plus additional num_nodes-1 processes identical to P launched on num_nodes-1 distinct remote workstations (chosen by those one found in file /etc/gamma.conf) via ``rsh'' command.

Hence after having invoked gamma_init() the invoking user process P is replicated on num_nodes workstations in the cluster, thus forming a running SPMD parallel application.

The process replicas themselves eventually invoke gamma_init(), but this time the effect is that of registering themselves with the created group, without creating new ones.

A positive number called ``parallel pid'' uniquely identifies the newly created process group in the cluster.

Note that nothing prevents two independent user processes P and Q to invoke gamma_init() separately from one another. This will result into the creation of two distinct GAMMA process groups in the same cluster, each with a distinct ``parallel pid''. The two groups may share some or even all the available workstations in the cluster, but cannot share processes.

Currently invoking gamma_init() with num_nodes less than or equal to zero or greater than the total number of workstations connected to the cluster has the same effect as num_nodes were equal to the total number of workstations connected in the cluster.

int gamma_exit (void);

The invoking process terminates the parallel computation, exiting from its process group. The process who created the group (and got instance number 0) will destroy the group as soon as every other process instance has left the group.

int gamma_set_active_port ( unsigned short port, unsigned short dest_node, unsigned char dest_par_pid, unsigned short dest_port, void (*receiver_handler)(void), unsigned short semaphore, unsigned char buffer_kind, void *destination_buffer, unsigned long buffer_len ); int gamma_set_passive_port ( unsigned short port, unsigned short dest_node, unsigned char dest_par_pid, unsigned short dest_port, unsigned short semaphore, unsigned char buffer_kind, void *destination_buffer, unsigned long buffer_len );

Activation of one out of the 1025 bidirectional communication ports of the calling process, numbered from 0 to 1024. Ports 1023 and 1024 are currently reserved to GAMMA collective routines (broadcast and barrier synchronization respectively). The calling process must have previously invoked gamma_init().

The communication port may be programmed for output, input, or both.

An output port must be bound to an input port of a remote receiver process which outgoing messages are to be delivered to. Such remote port is fully specified by the triple dest_node (instance number of the receiver process), dest_par_pid (``parallel pid'' of the process group which the receiver process belongs to), and dest_port (a specific input port of the receiver process). Note that inter-group communication is allowed.It is not allowed for a process to connect a port to itself for output.

Parameter dest_node may be set to the constant BROADCAST. In this case, each message transmitted through the port will be broadcast to each process in the group specified by dest_par_pid (excluding the sender itself). Each receiver process will get the message through its local port specified by dest_port.

An input port must be bound to a destination buffer, a notification semaphore, and a receiver handler (active ports only).

The destination buffer is a contiguous virtual memory region in application space; its size in bytes is specified by buffer_len. Any non-empty message arriving to the port will be stored in such buffer. Specifying a destination buffer is mandatory only if non-empty messages are to be received.

Many common data structures (for instance, arrays) span contiguous regions in virtual memory space, therefore in most cases there is no need of providing separate buffers for incoming messages.

If the current message fits the destination buffer exactly, the next message hitting the same port will be stored at the beginning of the same buffer, thus overwriting the current one (unless the port has been bound to a different destination buffer meanwhile).

If a message arrives which is larger than the destination buffer, then the message is truncated to fit the buffer.

If the current message is shorter than the destination buffer, and the port has not been bound to a different destination buffer before a new message hits the port, then the next message will be stored in the same destination buffer; either contiguous next to the previous message (in case buffer_kind is set to GO_AHEAD), or at the beginning of the buffer itself (in case buffer_kind is set to GO_BACK) The former mode helps building gather-like communication patterns; in such a case, however, if the new message is larger than the remaining room in the destination buffer, it is truncated to fit the buffer.

The receiver handler is an application-defined function, which will be executed each time a new message hits a port, provided the port has been set up by invoking the gamma_set_active_port() routine. Empty messages hitting the port will trigger the receiver handler as well.

The receiver handler will run after the message body (if any) has been copied to the destination buffer (if any). New messages hitting the port will not be stored into the destination buffer before the receiver handler has run to completion.

A receiver handler should not loop for ever, and may invoke any GAMMA call in turn (however, invocation of GAMMA flow controlled send routines may lead to a deadlock).

In order to allow a receiver process to synchronize to input events (message arrivals, handlers activities) in a safe way, GAMMA provides 1025 per-process notification semaphores numbered from 0 to 1024. Semaphores 1023 and 1024 are reserved to GAMMA collective routines (broadcast and barrier synchronization respectively). Each port being used for input must be associated to one such semaphore. Each time a message hits the port, its semaphore get incremented by one. Additionally, receiver handlers may also increment other semaphores, if programmed to do so, by invoking gamma_signal(). A receiver process can wait upon message arrivals or handlers activities by invoking gamma_wait() or gamma_test(). Semaphores are initialized to zero by gamma_init().

Recall that any GAMMA port can be programmed to be output and input simultaneously, provided the correct parameters are passed to the gamma_set_active_port() gamma_set_passive_port() routines. The actual use of a port as an input or output one depends on its use by the application.

int gamma_post_recv ( unsigned short input_port, void *destination_buffer, unsigned long buffer_len );

Port specified by input_port is bound to the specified destination buffer. The next message hitting the port will be stored into such buffer . The buffer is required to span a contiguous region of virtual memory. Many data structure of common use (arrays, for instance) fulfill such requirement.

This is a low-overhead alternative to the gamma_set_active_port() and gamma_set_passive_port() functions. It does not require invoking any system call, as the buffer address and size are actually kept in the user data segment. Its intended use is within receiver handlers, in order to prepare a fresh application-space buffer for incoming messages after having consumed the previous one.

int gamma_send ( unsigned short output_port, void *data, unsigned long len ); int gamma_send_flowctl ( unsigned short output_port, void *data, unsigned long len );

int gamma_send_2p ( unsigned short output_port, void *data1, unsigned long len1, void *data2, unsigned long len2, ); int gamma_send_2p_flowctl ( unsigned short output_port, void *data1, unsigned long len1, void *data2, unsigned long len2, );

A message is sent through the port specified by output_port with blocking semantics. The output port is supposed to have previoulsy been bound to a remote destination by the gamma_set_active_port() or gamma_set_active_port() functions. The message is composed by two ``pieces'', possibly stored into two distinct memory regions in the user space, and of possibly different size (2-way gather). The first ``piece'' (specified by data1 and len1 is not allowed to be larger than 20 bytes. These routines are intended as support for MPI/GAMMA.

int gamma_isend ( unsigned short output_port, void *data, unsigned long len ); int gamma_isend_flowctl ( unsigned short output_port, void *data, unsigned long len ); int gamma_isend_2p ( unsigned short output_port, void *data1, unsigned long len1, void *data2, unsigned long len2, ); int gamma_isend_2p_flowctl ( unsigned short output_port, void *data1, unsigned long len1, void *data2, unsigned long len2, );

Respectively similar to the gamma_send(), gamma_send_flowctl , gamma_send_2p, gamma_send_2p_flowctl , but with non-blocking semantics. A ``handle'' is returned, which unambiguously identifies the initiated send operation and can be used to wait/test for its completion (see gamma_wsend(), gamma_tsend())

The memory region(s) referred to by these non-blocking send routines should have previously been locked and prefetched in physical RAM (see gamma_mlock()).

int gamma_wsend ( unsigned long handle ); int gamma_tsend ( unsigned long handle );

Wait/test for completion of non-blocking send operations initiated by gamma_isend(), gamma_isend_flowctl gamma_isend_2p, gamma_isend_2p_flowctl routines. Function gamma_wsend() blocks until the send operation specified by handle completes. Function gamma_tsend returns 1 if the send operation specified by handle has completed, otherwise 0.

int gamma_signal ( unsigned short sem );

In order to allow a receiver process to cooperate and synchronize with receiver handlers in a safe way, GAMMA provides 1025 per-process semaphores numbered from 0 to 1024. Semaphores 1023 and 1024 are reserved to GAMMA collective routines (broadcast and barrier synchronization respectively).

Semaphores are initialized to zero by gamma_init().

gamma_signal(sem) causes semaphore sem to be atomically incremented by one.

Typically such function is issued by a receiver handler in order to notify the arrival of a message to the main thread of the receiver process.

int gamma_sigerr ( unsigned short sem );

GAMMA also provides 1025 per-process error semaphores numbered from 0 to 1024. Semaphores 1023 and 1024 are reserved to GAMMA collective routines (broadcast and barrier synchronization respectively).

Error semaphores are initialized to zero by gamma_init().

gamma_sigerr(sem) causes error semaphore sem to be atomically incremented by one.

Typically such function is issued by a receiver handler in order to notify a receive anomaly to the main thread of a process.

int gamma_wait ( unsigned short sem, unsigned long n );

The invoking process busy-waits until semaphore sem raises value n. Semaphore sem is atomically decremented by n upon return.

Typically such function is invoked by a process waiting for message arrivals. Semaphore sem is typically incremented by some receiver handler issuing gamma_signal(). During the busy-waiting the NIC is polled for incoming frames so as to speed up message arrivals by avoiding IRQ overheads. However this is only an optimization, which does not change the semantics.

On return, gamma_wait() yields zero if no receive errors were encountered, otherwise it yields a negative number whose absolute value is the count of how many times the function gamma_sigerr has been issued on error semaphore sem since last run of gamma_wait.

int gamma_test ( unsigned short sem );

Returns the current value of semaphore sem. The value of sem is left unchanged.

int gamma_atomic ( void (*funct)(void) );

Function funct is executed atomically, that is, it will not be interleaved by any receiver handler. This allows for any function of the user program to be issued safely in case it shares data structures with receiver handlers.

int gamma_sync (void);

Barrier synchronization among all processes within a process group. After calling gamma_sync(), the caller process resumes execution successfully (that is, without error code) only when all other processes in the same group of the caller have reached the gamma_sync() function.

Exploiting a 2 tokens synchronization mechanism, the GAMMA implementation of this collective communication primitive achieves best performance over shared Fast Ethernet channels.

int gamma_my_par_pid (void);

Returns the ``parallel pid'' of the GAMMA process group of the caller, as assigned by the previous call to function gamma_init().

int gamma_my_node (void);

Returns the instance number of the caller process, relative to the GAMMA process group of the caller itself. If the group counts num_nodes processes, the returned value will be in the range from 0 to num_nodes-1. The process which created the process group has always instance number zero.

The programming paradigm supported by GAMMA is Single Program Multiple Data (SPMD). In this paradigm, each process may differentiate its behaviour by testing its own instance number.

int gamma_how_many_nodes (void);

Returns the number of process instances belonging to the GAMMA process group of the caller process.

int gamma_mlock ( void *buffer, unsigned long len );

This function pre-fetches and locks into physical RAM a contiguous region in the virtual memory of the calling process starting from address buffer and counting len bytes.

Usually such a contiguous memory region is a store for outgoing messages to be sent by a non-blocking, zero-copy send routine. It must be pre-fetched and locked into physical RAM in order for the DMA engine of the network adapter not to upload unexistent pages on transmission.

gamma_mlock() adds the pre-fetch functionality to the standard UNIX mlock() function.

int gamma_munlock ( void *buffer, unsigned long len ); int gamma_munlockall (void);

These functions unlock previously locked memory regions. They are very similar to the standard UNIX munlock() and munlockall() calls.

void gamma_time(time_586 t);

The content of Pentium's register TSC is copied to variable t. Type time_586 is defined as struct { unsigned long hi; unsigned long lo; }

Register TSC is incremented by one at each CPU clock tick, so this function is useful for time measurements involved in performance evaluations.

double gamma_time_diff(time_586 b, time_586 a);

The time interval between instants b and a (possibly recorded by means of the gamma_time function) is computed in microseconds and returned as result.

Currently the conversion from CPU clock ticks to microseconds requires a constant named CLOCK to be set to the CPU clock frequency in MHz before compiling the GAMMA library. More information in the README file enclosed with the GAMMA source code.

int gamma_active_port;

During the execution of a receiver handler, such variable holds the number of the port which has triggered the execution of the handler itself.

int gamma_msglen;

During the execution of a receiver handler, such variable holds the size of the message that triggered the execution of the handler itself on arrival.

Giuseppe Ciaccio, ciaccio@disi.unige.it