Software Coordinating Committee Workshop Jefferson Lab November 8-9, 2001 9:00 AM EST Recorder: C. DeTar Present Nov 8 and 9: Rich Brower, Jie Chen (JLab), Robert Edwards, Chulwoo Jun (Columbia), Steven Gottlieb (Indiana), Don Holmgren, Carlos Mendez, Massimo Di Pierro (FNAL), Bob Mawhinney, Chris Miller (Columbia), James Osborn (Utah), Andrew Pochinski, Chip Watson, Carleton DeTar Present Nov 9: David Richards (Old Dominion U) Nov 8 1. Brower: Outline of Agenda 2. Watson: Toward Finalizing the MP-API Points of discussion and agreement: (1) SMP flag in initMachine: Agree is necessary (2) Polymophism (send relative or send to flag) in message handle Agree that this does not add significant latency and is needed for portability. (3) Need for send relative. Columbia finds this scheme important. Edwards finds it useful for portability between grid and switch-based machines. MILC would not use it. Agree that if enough of the community wants it, it should be implemented. (4) Need for declare topology While Columbia will take this to be a null operation and MILC will not use it, it is necessary for send relative. (5) Start send and start receive Columbia prefers single start transfer call -- see below (6) Naming conventions Prefixes: QMP for message passing Level 1 QLA for linear algebra Level 1 QAP ?? for Level 2, Level 3 C binding: functions: all lower case, separated by underscores as in QMP_start_transfer macros: constant values all upper case macros: that expand as functions - same as functions structures: same as functions C++: same prefixes as C, but use namespace mixed case for rest of name with no underscores - leading lower case for functions, leading upper for class names (7) System parameters Access with QMP_get_XYZ(). Set with bool QMP_set_XYZ(). Returns false if fails. (8) Buffer names and message IDs In QCDOC and QCDSP communication starts by opening a channel between a definite pair of nodes. All messages are sent and received in sequence. This scheme appears to be satisfactory for QCD. Thus it is not necessary to use buffer names and message IDs in the API. For an MPI implementation, one could attach a message ID that is incremented in sequence. May need to add MPI-style calls for cases where wild cards are needed. Add them later according to demand. 2. DeTar, Mawhinney: MILC on QCDOC simulator Toussaint still trying to get conjugate gradient to run to completion. New version (2.3) of chip design may solve some problems. 3. Holmgren, Edwards, Gottlieb, Di Pierro: P4 Optimization Holmgren described performance of 10 critical MILC subroutines with SSE enhancements. Coded in NASM. Single site calls. Edwards described a different SSE strategy by high school student Chris McClendon. Showed benchmarks with parallel-style single site Dslash full lattice calls. Edwards to post these figures on web page. Gottlieb discussed MILC Dslash communications limitations. Showed Platinum cluster numbers. DiPierro (see below) 4. Chulwoo Jun: Changes in CPS toward MP-API Nearest neighbor is highest priority for QCDOC CPS development. 5. DeTar, Mawhinney: Immediate goals of CPS implementation Most pressing need for MILC on QCDOC is to get MILC code data on the simulator with a Level 1 port. Assuming all goes well, after processors become available next summer, the next crucial need is completing the API so point-to-point communications are supported. Would like a timeline for this. 6. DiPierro: MDP project Started prior to SCIDAC. Continues to develop this powerful data parallel scheme. Showed benchmarks for SSE-implementing code with 50% performance hit over Luescher values. Could take advantage of SCIDAC Level 1 and linear algegra package as it develops. 7. Mendes: Performance tools SvPABLO running over PAPI provides a very user-friendly GUI for doing performance analysis. No "perfctr" libraries for PAPI for P4, yet, however. PABLO SDDF (self-defining data format) could be used to process SvPABLO output. Nov 9 8. Edwards: Level 2 API v 0.1 Combines shifts and multiplies Discussion: Can we agree on a standard layout? Mawhinney: This includes layout of SU(3) data types as well as site order. 9. Pochinski: Conjugate gradient code with simulated Level 2 calls Discussion: DeTar: Can we push this to see if it breaks? What about Schroedinger functional inversions that fix data on the end time slices? What about Symanzik gauge heatbath updating that requires 32-level checkerboarding. What if we do LU preconditioning? Brower: What if we put a minimal set of operations in Level 2 and adopt a Linux/GNU model accepting user-contributed extensions as long as they are portable. 10. Holmgren: Performance counter tools "trace" runs on Linux 2.0, 2.2, 2.4 and gives processor statistics. Needs a kernel patch to access P4 performance meters. Multinode version is running. Graph of times for internode message passing vs times for computation in Asqtad dslash show that computation is much faster than communication - not much hiding possible. Mawhinney: This looks like a great way to get information about latency in message passing: software, hardware. 11. Richards: Port of UKQCD Peter Boyle code to JLab cluster with gm/MPI Goals for 2nd Quarter: (Oct 15 - Jan 15) 1. MP API a. QMP API C, C++ defined, documented b. QMP (c) over MPI finished. c. QMP (C) over GM finished for Myrinet clusters d. QMP (C++ and C) nearest neighbor on QCDOC and API sufficiently analyzed for high perf implications 2. Level 2 (and Level 1 QLA) Repertoire of functions and naming conventions. Level 1 & 2 kernel with on Data Layout supported by all platforms for maximun peformance. 3. MILC (Level 1 port) on QCDOC simulator With sufficient information that we can forecast performance. 4. P4 Optimization. Conference concluded at 12:30 PM EST.