Subject: Re: Conference call this Thursday. Hi Rich, You suggested a concrete example of an implementation of Dslash. This requires establishing standards for describing the lattice geometry and for message passing maps. So, here is another straw man proposal for discussion in our conference call tomorrow. It is just an outline so far - no details fleshed out, yet. Regards, Carleton ---------------------------------------------------------------------- MILC-Inspired Dslash Scheme for SciDAC August 8, 2001 The following is inspired by MILC code practice, but attempts extensions for grid-based networks. Let's see whether it has the functionality we need. For Dslash we need to describe the lattice geometry and give information about the message passing map. First the lattice geometry. I haven't tried to integrate this scheme with Robert's SMP scheme, yet, but it should be easy to do so. Lattice geometry I wanted a scheme that does not enforce a particular layout, since some remapping may be required for optimization. We lay out the lattice fields on each node in the form TYPE field[NSITES_PER_NODE]; for the 5th dimension we would have, for example, TYPE field[L_S][NSITES_PER_NODE]; We provide data for ease in identifying the coordinates: typedef struct { short x,y,z,t; byte parity; checkerboard label } coords; so we have coords lat[NSITES_PER_NODE]; The complete lattice layout is defined by the following: int nx, ny, nz, nt, L_s; /* lattice dimensions */ int sites_on_node; /* number of sites on the node */ int even_sites_on_node, odd_sites_on_node; /* used a lot */ int mx, my, mz, mt; /* for grid-based machine: physical grid dimensions */ For quick reference, we may also want to store the result of a call to the message API getMachineType(), so int machinetype = getMachineType(); and we probably need an identifier for the lattice geometry in case we do remapping. geometry_index The order of sites and their arrangement across the nodes is not forced, so we also need the three functions int node_number(x,y,z,t) gives the CPU or node number (rank) to which site x,y,z,t is assigned int *node_coords(x,y,z,t) for a grid-based machine, gives the physical grid coordinate to which site x,y,z,t is assigned. Returns a grid coordinate vector of integers. int node_index(x,y,z,t) The lattice field index for the site x,y,z,t on the local node. So if the site is local, then the value of a field at that site is found from field[node_index(x,y,z,t)] These three functions and the coordinate values in lat[] define the "geometry" as I understand it from Robert's virtual geometry API proposal, so would be initialized at the start of the code and could be modified or replaced whenever remapping is needed. So they would presumably be placed in the proposed "geometry" structure. While the layout is not forced, some layouts are better for grid-based machines, so the geometry index says whether that has been done. If so, Dslash, message passing, etc can take advantage of optimizations. ---------------------------------------------------------------------- Next we need to work out the message passing pattern. The MILC procedure involves five main steps: make_gather A procedure that sets up the communications pattern for an arbitrary permutation map of sites onto sites. shifts are an important special case. This is called at any time, but generally only once at the beginning of execution. If the lattice geometry is changed through remapping, this procedure may have to be rerun. start_gather Sets up message passing for one of the gather maps and sends the message asynchronously. restart_gather Resends a message that was previously initialized by a start_gather. wait_gather Waits for messages to arrive. cleanup_gather Housekeeping: Frees storage. All of these procedures lie on top of the message API. We have found it very useful to standardize around them. The make_gather returns a value that identifies the map. This is passed to start_gather. In turn start_gather returns a pointer to a structure that defines the communication buffers, etc. This pointer is passed to restart_gather, wait_gather, and cleanup_gather. ---------------------------------------------------------------------- Dslash We come, finally to Dslash. The MILC procedure involves interleaving computations with calls to the above asynchronous gather routines for whatever shifts are required to define the Dirac operator. One waits for arriving data before using it. While this strategy hides the details of the layout and message passing, there is a weakness: in a typical gather, some data is local and some is remote, but with the MILC procedure, one does no calculation on any of the gathered data, local or remote, until the remote data has arrived. One could optimize by computing immediately with the local data involved in the gather, while waiting for off-node data. Still, one can organize Dslash so that a great deal of the message passing latency is hidden. So a combined shift, compute operation would be better where optimization is essential. However, when one multiplies the number of possible shifts by the number of possible computations and various numbers of operands in the computation, gathered or not, there are too many combinations. So, except for the specialized computations involved in Dslash and a few other rate-limiting parts of the code, I think we are better off segregating shifts and computation. ----------------------------------------------------------------------