LQCD Message Passing API (Chip) Text verion of http://www.jlab.org/~watson/lqcd/MessageAPI.htm) Version 0.3 26 March 2001 This short note summarizes some of the requirements for message passing, and gives a straw-man API. This still needs to be further compared to existing API’s for messaging, but does incorporate feedback from the group. There were several who felt that transmitting a hypersurface would probably not be worth the effort, so for now that is dropped. In the next version I will revisit this, to better take advantage of the strided access hardware of the QCDxx machines. Capability Requirements 1.Barrier call (synchronize all nodes). 2.Send a contiguous message to a given node (identified by a single number, application manages lexicographic ordering) 3.Send a message to a neighboring node (identified by a direction as a number, with magnitude of the number indicating dimension, sign of the number indicating left/right; library manages mapping onto physical node) 4.Broadcast a message to all nodes. 5.Global sum for 32 bit & 64 bit floats, ints, and arrays of same. Calls to do the same for complex may also be provided, perhaps as a convenience routine above these. API Design: Performance Requirements 1.Design must allow for overlapping of computation and communications. Hence, initiating the send (or receive) of a message must be decoupled from testing for or waiting for its completion. 2.Design must allow for issuing multiple sends without waiting for the first to complete. Example, send in all 4 positive directions. This is important for myrinet so that overhead of “filling the pipe” is not incurred for each message. 3.Design must avoid forcing the use of barrier calls across the whole machine when all that is really needed is to wait for a single neighboring node. Therefore one must be able to poll or wait for receipt of a particular message, instead of using a global barrier. 4.Design must allow expensive operations involved in defining a communication to be done ahead of invoking the communication multiple times. Example: locking virtual pages in memory for use by a PCI DMA engine. 5.Attempts should be made to minimize bookkeeping overhead on host and any intelligent interfaces. Hardware Issues A design (probably not the only one possible) that addresses these performance constraints is something along the lines of a remote memory write library. Ignoring the scatter-gather issue, i.e. restricting the design to contiguous messages, consider the following behavior: 1.Node A declares an intent to (repetitively) receive into buffer Q. At this point pages are locked in memory, and physical addresses for Q are determined. 2.Node B declares an intent to (repetitively) send a buffer R to Node A’s buffer Q. At this point R is locked in memory, and physical addresses are determined. Also, whatever is necessary to compute the target destination (network address and perhaps also remote memory physical address) is done. 3.Node A initiates a receive operation for Q. If intelligent interconnects are used, this is a no-op or for strong synchronization may clear/set a flag (semaphore). 4.Node B initiates the transfer. 5.At some later time, Node A tests to see if Q has received new data. Effectively, this defines a channel from B’s R to A’s Q, or allows B to remotely write to A’s memory, especially if step 3 is a no-op. Hardware issues: on a myrinet system, the receiving network interface card (NIC) can autonomously write into the receiving host’s memory, at an address determined by the NIC with no host. Also, sends are queued in a FIFO, enabling one to satisfy performance requirement 2. On the QCDxx each wire has hardware for both send and receive. API Example To see how this might look, below are a few representative calls, written as a C api. Host A: opaqueQ = declareReceive ( bufferQ, nbytes, int bufName); … startReceive(opaqueQ); Host B: opaqueR = declareSendTo ( buffer, nbytes, int remoteHost, int remoteBufName); … startSend(opaqueR); As can be seen from this example, buffers are identified by an integer, which one can thing of as the logical name of the buffer. This name is used by the remote node as the target of a write or send operation. Send and Receive are asymmetric in that the receiver does not specify where the data is coming from. For myrinet, this send operation could be implemented as a single move instruction into a control fifo of the myrinet NIC, where the value moved (opaqueR.myri) could be pointer to a structure previously created in the NIC’s memory. That structure could have a pre-digested set of values to be moved into the PCI DMA engine (among other things). For the QCDOC, opaqueR could be a pointer to a structure containing all values needed to be moved into the corresponding link’s transfer engine. Message API Machine Initialization int initMachine(void); int getMachineType(void); return -1 [butterfly], 0 [switch], >0 [grid] int shutdownMachine(void); void declareTopology (int dimensions[], int ndimensions, int localHost); this tells library the grid problem size, where dimensions[0] is the x dimension, and localHost is the number of this local host, where x varies most rapidly Issue: should application determine localHost (e.g. from startup args), or should library figure that out? I/O Declarations void * declareReceive (void * buffer, int nbytes, int bufName); void * declareSendTo (void * buffer, int nbytes, int remoteHost, int remoteBuffer); remoteHost is an integer [0,#machines-1], where machines are presumed to know their machine number void * declareSendRelative (void *buffer, int nbytes, int relativeHost, int remoteBuffer); relativeHost is an integer, +1 is +x, +2 is +y, -2 is –y, etc. Any errors return null pointer, and error info is retrieved via a separate calls: char * getErrorString(); int getErrorNumber(); I/O Operations void startSend (void * opaque); void startReceive (void * opaque); int testCompletion (void * opaque); int waitFor (void * opaque, int milliseconds); int waitForMultiple (void * opaque[], int count, int milliseconds); possible implementation idea for myrinet is to have startSend/Receive set a memory location, which is cleared on operation completion; testCompletion then just tests this memory location; for QCDOC this could operate on control registers void sendBroadcast (void * buffer, int nbytes, int remoteBuffer); bool globalOr(bool b); int sumInt(int i): float sumFloat (float x); double sumDouble (double x); void sumFloatArray (float x []); // operation is done “in place” void sumDoubleArray (double x []); // ditto void waitForBarrier();