Message Passing Activity ------- Chip Watson (see http://www.jlab.org/~watson/lqcd/MessagePassingActivity.html) This small memo is aimed at sketching out the scope of work related to message passing in our SciDAC proposal. It is meant as a starting point for discussions primarily. Overview An important class of LQCD applications will spend most of their time computing on a large lattice distributed across many nodes of a large computer. These codes would be most easily written if we had a shared memory architecture, but such architectures don’t scale well and are too expensive. Instead, we have a message passing architecture -- in fact we have two message passing architectures, which may lean towards different software implementations and styles of programming interfaces. For the QCDOC, message passing is cheap to initiate and has very low latency. As a consequence, it is possible to adopt a programming style that uses very many short messages. For clusters, message passing can be relatively expensive if not well implemented, and in any case has much higher latency. (Detailed performance discussion is given in the section below). On the other hand, non-nearest communication on a cluster is as easy as nearest neighbor communication because (currently) we are using switched interconnects. The higher cost of initiating messages on a cluster leads to a strategy of sending as few messages as possible. And because of the non-negligible latency (particularly with larger messages), it is necessary on clusters to carefully craft the calculations to overlap computation and communication. Any standardized API we adopt for message passing should take into account these architectural constraints. Performance Observations As a reference calculation, consider the Dirac operator on a 163*32 lattice. Assume each node in a 256 node cluster (our FY01-02 prototypes at Jlab and FNAL) delivers 1 Gflops, and that the nodes are arranged as 2x4x4x8 boxes and that red-black pre-conditioning is used. Each node then processes 44 sites, and communicates 8 messages of 64 sites (twice). For our reference problem this means eight 3 KByte messages every 0.7 milliseconds. Myrinet, at 125 MB/s full duplex and 6 usec short message latency (and a few other relevant values) should allow sustaining about 85% of the nominal 1 Gflop across the cluster. A change of 1 usec in either processor overhead to start a message, or node-to-node message latency results in about a 1%-2% change in cluster efficiency, so it is important to keep the processor overhead as low as possible. Myrinet Implications & Strategy When using myrinet for communications, pages in virtual memory containing the message to be sent must be locked in physical memory, the physical memory address and length must be given to the I/O processor, which sets up a DMA from host memory to its memory, then another DMA into the switching fabric. The receiving I/O processor DMA’s from the switch fabric into its local memory, then DMA’s the message into its host memory (which meanwhile must have been locked in memory). Effectively this results in 3 copies of the message: host to I/O, I/O to I/O, I/O to host. The 3 KB messages can be efficiently cut in half to pipeline these copies, but “filling the pipe” is still a non-negligible latency term. For a single message, this also means a considerable amount of overhead (mapping, locking, DMA control structure initialization). Fortunately, LQCD has a highly repetitious message pattern, so that these expensive operations can be done once and used repeatedly thereafter. To make this work, we must design an API that facilitates re-use of the setup costs of a transfer. The current myrinet software (Linux driver and I/O processor firmware) is not optimized for such repeating messaging patterns, so Jlab will write new and/or modify existing software to avoid these overheads (already assumed in the above calculations) and shave a couple of microseconds off the existing gm short message latency (also already assumed above). We will also undertake to re-optimize global sums for both single and double precision numbers, attempting to carry out as much of the work as possible within the I/O processors. The probable end result is one call which defines (half of) a communication path, along the lines of copy so many bytes from this starting address to a named (numbered) buffer on a remote note, plus a corresponding call on the other end to declare a named (numbered) buffer. These calls return opaque handles which can be used to trigger the transfer like doTransfer(opaque) and checkTransfer(opaque) and pend(opaque). SciDAC Organization The message API activity spans multiple tasks (as enumerated in our proposal), and the following attempts to create a more detailed list of sub-tasks related to messaging. Sub-task numbers are relative to the proposal task numbers: Task 2.1 Implement efficient message passing for Myrinet, aiming at minimal host and I/O processor overhead and message latency. The resulting low level calls may NOT have any relationship with MPI (i.e. NOT a subset). Task 5.1 Analysis / design of a lattice QCD message API which is able to meet the needs of SZIN, MILC, CPS and expected future applications, and which is composable from the message primitives of sub-task 2.1 Task 2.2 Implement this API on myrinet. Task 2.3 Implement this API on QCDOC. Task 2.4 Implement this API on MPI, to allow portability to other architectures already supporting MPI; this implementation need not guarantee high performance, only portability. Task 1.2 Test the performance of SZIN using optimal algorithms using this API.