Software Coordinating Committee Workshop Jefferson Lab February 2, 2002 11:30 AM EST Recorder: C. DeTar Present: Rich Brower, Jie Chen (JLab), Michael Creutz, Carleton DeTar, Robert Edwards, Chulwoo Jun (Columbia), Steven Gottlieb (Indiana), Eric Gregory (Arizona), Don Holmgren, Bob Mawhinney, Carlos Mendes, David Richards (Old Dominion U), James Osborn (Utah), Andrew Pochinski, James Simone, Chip Watson ** Action items 1. Level 2 QDP with MILC implementation James Osborn presented his version of Pochinski's "vertical slice" with a MILC implementation of the QDP calls, illustrating lazy shifts and lattice field declarations. There was discussion of the use of pointers in the implementation to express the result of a gather. ** Edwards requested that there be a follow-up illustration of Pochinski's example, but with even-odd preconditioning. ** Mawhinney recommended that in viewe of the state information being carried by the fields, further testing should be done to make sure the implementation behaved correctly in all cases. Break for lunch 12:30 - 2:00 2. Level 1 MP-API Chip Watson presented final revisions of the Level 1 MP-API specification, which was posted immediately prior to the meeting: http://www.jlab.org/~watson/lqcd/MessageAPI.htm The following items provoked more extended discussion and led to agreements: a. Should we require all lattice initialization implementations to call "declare_logical_topology", with its inherent "logical" grid view of the machine? Doing so would make it slightly easier to set up communications on a grid-based machine. And in the present implementation, it is the only way a node finally can be told its rank. The worry was whether in so doing, we would be imposing an unnecessarily rigid view of a switch-based machine. After understanding that the view would only have implications for relative sends, but for send-to operations, would impose no restrictions, we decided to enforce the requirement. b. Should we drop the "declare_send_map" feature? As long as it was not needed for QCDOC to do efficient Manhattan-style routing for non-adjacent communications, we agreed to drop it. c. In "send_relative" communication, should we label the directions explicitly as x, y, z, t, with the meaning of lattice directions? The present design allows for a remapping of the convention-neutral directions 0, 1, 2, 3. The Columbia group wanted to make it explicit, since in the machine allocation process, there is already enough flexibility to set the convention. We agreed that the protocol would be revised so the interface inquires about the grid, attempts to lay out the lattice without further permutation of the lattice coordinates, and exit if it fails. d. Should we include a "sum_double_extended" to allow checking for round-off errors in global sums? We agreed to do so, but for now to require implementations merely to guarantee that the precision would not be worse than "sum_double", to allow for cases in which the locale did not support a native precision extended beyond "double". 3. Level 2 QDP Design DeTar presented his view of the QDP Design, which previous proposals by Edwards. This design was the basis for Osborn's concrete example. There seemed to be general agreement about the outline, but disagreement about some of the specifics. Here are the issues that remain to be resolved and action needed to help resolve them: a. In place of the next-neighbor shifts, should we be allowing a MILC-style general permutation map of sites onto sites? The main objection was that we might be tempting QCDOC users to write terribly inefficient code, since only operations derived from next-neighbor shifts would work efficiently. Since not allowing such maps would rule out the possibility of writing an FFT with QDP calls, the suggestion was that all sites should then be required to provide a native hand-optimized FFT. In support of generality, it was argued that one should not be imposing unnecessary restrictions on our ability to invent new actions and observables and implement them with QDP calls. FFTs may not be the only place where we will need such a capability. ** For now, DeTar and Osborn will flesh out the idea further. A possible resolution would be to include general user-defined maps, with next-neighbor shifts automatically included in the repertory, with fixed displacement maps created by special calls, so they could be set up efficiently on the QCDOC, but still allowing the user the option to do general permutations with suitable caveats to the user. b. Do we expect to be able to write almost all of our code (i.e. except for the most rate-limiting steps) with QDP calls? There was strong agreement about this. ** Columbia will think about the implications for CPS. c. Should we require a canonical layout as opposed to arbitrary user-defined layouts in the MILC style? There seems to be no reason not to have all layouts carve up the lattice on 1000 surfaces, to use crystallographic notation. So the question is how sites in the sublattices are to be ordered. The choice here is between flexibility and our ability to write optimized code to implement Level 2. If we enforce layouts that put sites in array subscript order, e.g. [x][y][z][t], then an on-node displacement map is simply an offset, and requires no reindexing. Thus a parallel transport operation could be coded efficiently. However, operations on subsets, such as only even sites, would almost always require a mask or reindexing or both to traverse the sites in this layout. If we enforce layouts that put even sites first, then subsets are easily traversed, but the result of a displacement map must be described by pointers or an index array. On the other hand, if we allow an arbitrary ordering of sites, even though some would be strongly recommended, we would be tempting the user to write inefficient code, and the implementation would always have to check the layout scheme to see whether an operation could be performed efficiently or had to be done by a less efficient alternative scheme. There are number of ways we can go here: (1) Enforce a canonical layout across all platforms and publish it to the user, so a user could make assumptions about where x,y,z,t is. (2) Enforce a canonical layout, but hide it from the user, so the user would always have to go through an index function index(x,y,z,t) to locate data for a site. (3) Allow arbitrary site ordering. In this case, also the implementations would have to go through an index function index(x,y,z,t) to locate data for a site, unless there was a way to detect that the user had chosen one of the canonical orderings. (2') As a variant on (2) we could say that each implementation could declare a canonical layout, but we would not require it to be the same across all implementations. Thus portability would be achieved if the user wrote code that operated only via QDP calls or at Level 1 through the index(x,y,z,t) function. We did not resolve this issue in the meeting for want of more data, as follows: ** (a) DeTar will consult with MILC and FNAL to see if there is agreement about how much flexibility is really needed. ** (b) DeTar and Osborn will consider carrying out performance testing with a mock parallel transport operation to see how much masks and reindexing degrade performance. ** (c) Edwards and Mawhinney will think about implications of (1), (2), (3) and I would propose, also (2') for CPS and SZIN. d. Should there be special map (shift) calls that make clear what the user intends to do with the result? The issue here is that some implementations may chose to make the result of a shift be a list of indices or pointers to values in the source field, rather than a copy of these values. Thus a shift from field A with destination field B would result in B pointing into A for on-node data. The syntax of the shift would suggest to the user that B contained the resulting values. The problem then arises, what happens when the source field A is then changed? Then B would no longer hold the result of the previous shift. The simple solution would require that the QDP interface convert pointers to values in B before changing A, so the user would not be misled. However, the user may not care, in which case an unnecessary copy would have been done. We could avoid this if we provided a separate function call, say shift_volatile that would mean that if a user ever modified the source later on, he/she didn't need the values in B any more. When A was changed, the implementation would simply mark the field B as "dirty". If a user tried, later on to use B as an rvalue, without first refreshing B, a run-time fault would occur. Conference concluded at 8:30 PM EST.