Subject: Yet Another Next balloon Hi Robert and Rich, I tried to flesh out a bit more the ideas we have been discussing. Regards, Carleton ---------------------------------------------------------------------- Sketch of Level 2 API Function Calls and Data Types (0) Preliminary comments We have said we need a better-sounding prefix, but I use QAP provisionally. The names I have chosen are likewise subject to improvement. I haven't worked out thoroughly the details for the asynchronous communication scheme, but tried to get started on it. It is convenient in the MILC implementation to have the result of message passing be a list of pointers to on-node data and data in a communication buffer. Subsequent lattice-wide arithmetic operations may then be done through pointers. I have tried to incorporate this flexibility in the definition of the lattice field data type. (1) Layout routine This is a plug-in at compilation time. The user chooses from an assortment of source codes. The choice may depend on architecture and local implementation of message passing. All choices have the same names for entry points, so no changes should be required for the rest of the code when switching plug-ins. void QAP_define_layout() Input information needed - either via arguments or external function call lattice nx,ny,nz,nt SMP flag number of nodes and processors per node rank of the current node or processor Output: via additional public global entry points: int QAP_pe_rank(x,y,z,t) gives a rank for the processor element to which the lattice site x,y,z,t is assigned In SMP operation, the SMP node number is returned. int QAP_pe_index(x,y,z,t) gives the linearized index for the lattice site x,y,z,t in memory assigned to a processor element. In SMP operation, the index runs over shared memory locations. global variables: int sites_per_pe, even_sites_per_pe, odd_sites_per_pe We could also require a definition (see below) of QAP_lattice_subset even, odd, even_and_odd for specifying even, odd, and global subsets of the lattice. (2) Constructors and destructors for a lattice field see Pochinsky's QCDD_init_XX and QCDD_fini_XX QAP_lattice_XX * QAP_alloc_lattice_val_XX() QAP_lattice_XX * QAP_alloc_lattice_ptr_XX() Where XX is the data type: SU(3) vector, SU(3) matrix, Wilson vector, etc. Input: global sites_per_pe Members are int size (sizeof data type - needed only if we decide to pass void pointers for these structures.) QAP_lattice_XX * val (base address of the data) QAP_lattice_XX ** ptr (base address of list of pointers to data). Either val or ptr must be null. Which is allocated depends on the call. Other members are needed to implement asynchronous communication They specify how many messages involving this data are pending and give a list of message tags for those pending messages. void QAP_free_lattice_XX(QAP_lattice_XX *field) Frees memory. Forces all pending messages involving "field" to be completed first. (3) Defining subsets of lattice sites QAP_lattice_subset * QAP_make_lattice_subset( bool (*func)(int x, int y, int z, int t) ) Input: bool func(x,y,z,t) returns true if x,y,z,t is in the subset. Output: Structure QAP_lattice_subset has members bool *mask NULL if using min, max, stride. Otherwise a pointer to a mask array. int min, max, stride Defined only if use_mask = false QAP_free_lattice_subset(QAP_lattice_subset *) Frees memory. (4) Site-local lattice-wide linear algebra Names follow pattern for the corresponding Level 1 routines, e.g. void QAP_lattice_thr_T1T2T3op3( QAP_lattice subset *, QAP_lattice_T1 *, QAP_lattice_T2 *, QAP_lattice_T3 * ) I have in mind that these functions would either operate on explicit data or through pointers to data, at the user's choice. This flexibility would require a test to see which convention applies to each argument. These routines would also have to complete all pending communication before using an operand. Actually, we can allow sends to remain pending for all rhs values. (5) Communications void QAP_shift_XX( QAP_lattice_XX *dest, QAP_lattice_XX *src, QAP_lattice_subset *subset, int direction ) I would prefer generalizing to a shift over any permutation map of sites onto sites. However, we have discussed only lattice next-neighbor directions. Both dest and src must be allocated prior to the call. We require dest != src. A number of other consistency checks are needed. For example, if src is already involved in a pending message and is a dest for that message, the previous message must complete. However, a field can be a source for multiple messages. If dest has messages pending, those messages must complete first. The ultimate result of the shift is either a list of pointers or the data itself, depending on whether QAP_lattice_XX is pointer-type or value-type. What actually happens is that QAP_shift_XX starts communication, but does not complete it. Instead the communication information is stored in the dest and src structures so later linear algebra calls can wait for completion. void QAP_complete_XX( QAP_lattice_XX *field ) Force completion of pending messages for "field". Needed only if the user is mixing Level 2 and Level 1 calls, or otherwise breaking the rules of Level 2. Results in checking and updating completion status for all source-destination pairs in which field is involved. (6) Parallel transport We decided to implement this with a pair of primitives, namely, a next-neighbor shift followed by a multiply, or vice versa, depending on direction. (7) Message tags The message tags stored in the field structures must encode information about the source and destination fields, so that when we force completion of messages for a field, the pending message status for both source and destination fields can be updated.