From: Carleton DeTar To: brower@ctpup.mit.edu, edwards@jlab.org CC: detar@physics.utah.edu, osborn@physics.utah.edu Subject: Another trial balloon Hi Rich and Robert, Here is a revised suggestion for dealing with layouts at Level 2. In response to comments in our last call, I tried to remove some of the overly general features, at least for now. In this revision I wanted to start thinking about communication as well as lattice-wide linear algebra. My intention is to provide enough flexibility in the layout choice that the possibilities for optimization are not limited here. And I wanted to adopt an approach that can be easily implemented, either with the current MILC scheme of sites sending to sites, or with the QCDOC scheme of processors living on grids sending to neighbors on the physical grid. The layout issue has three components. (1) We have to specify how lattice values are distributed among the processors, (2) for each processor we have to specify how its assigned values are arranged in its memory, and (3) for each data type (matrix, vector, etc) we must specify how the member elements are arranged. I will not deal with (3) for the moment. How much has to be specified for (1) and (2) depends on the operation. For parallel transport over a lattice link, this layout must be known precisely. For site-local lattice-wide linear algebra operations, we can be less precise - e.g. if we want to do the same operation on all even sites, we only need to know which ones are even, and don't care beyond that how the even sites are ordered. (1) Layout specification: For defining the layout, I would like to propose adopting the method used in the MILC code. At startup, we call one of a variety of layout routines, one of which is compiled into the code. This gives us the flexibility of arranging data by time slices, by division on the longest remaining dimension, or, should we desire, by various cache optimization schemes. In all of our current layout choices, MILC adopts the policy that even sites appear together. So iterating over even sites can be done by merely specifying a range of site indices. The layout routines could be architecture-dependent in the sense that only some of them would work on the QCDOC and others would work on other machines. But the user would merely plug in the one appropriate to the architecture when compiling. No other changes should be necessary. What we need from the layout routine are the following functions and arrays: function int pe_number(x,y,z,t) gives a rank for the processor element to which the lattice site x,y,z,t is assigned function int pe_index(x,y,z,t) gives the linearized index for the lattice site x,y,z,t in memory arrays int x[k], y[k], z[k], t[k] invert the pe_index map. (2) Creating a lattice field: We provide a routine that creates a lattice field according to the built-in layout. This routine allocates the needed space and (perhaps) sets a flag that defines the layout. One can imagine a simple class or structure that has at least three members; a pointer to the base allocated address, a stride appropriate to the data type, and the layout flag. (3) Defining subsets of sites: We need a way to specify subsets of the lattice on which we do Level 2 operations. Clearly, we will have predefined subsets even, odd, and global. But users may wish to define other subsets, such as one color of a 32-color checkerboard, or one of the various time slices. So we provide a function call that defines the subset. Input to the subset-creation function is a Boolean function of (x,y,z,t) plus parameter(s) that returns true if that site is in the subset. Output would be a specification of how the subset is to be traversed in a local linear algebra operation: either a range of indices (kmin, kmax, kstride), or a list of indices (k1, k2,...) and number, and (for convenience) a Boolean array subset[k] that specifies whether the site x[k], y[k], z[k], t[k] is in the subset. If this is too general, we could fall back on providing only the even and odd types with the traversal specification defined in the layout call (1). Then if someone wants a different subset, he/she would have to revert to Level 1.5 and create his/her own traversal specification. (4) Site-local lattice-wide linear algebra: The level 2 site-local linear algebra routines operate only on the objects created in (2) and subsets defined in (3). The level 2 routine would then call Level 1.5 LA routines of the type proposed by Robert to complete the task. (5) Nonlocal operations: We limit level 2 nonlocal operations to two types: simple nearest-neighbor shift and parallel transport. The latter would overlap communication and computation. These operations also take only operands defined in (2) with subsets defined in (3). The result of a simple shift and a parallel transport would also be a lattice operand as defined in (2). In implementation, they require the functions and arrays produced by the layout call in (1). Comments, criticism? Regards, Carleton