FNAL PLANS (Don) Personnel: Don Holmgren, ~ 50% Ron Rechenmacher, ~ 50% Simon Epsteyn, ~ 25%, but hoping to go to 50% Jim Simone, ~ 25% Paul Mackenzie, ? Takashi Matsumura, ~ 75% Expected new hire, July 1, 100% Takashi just arrived from Okinawa after finishing his PhD in Computer Science. His thesis examined genetic algorithms on clusters, and he's experienced with MPI, PVM, and Linux. He's "free" (no salary cost to FNAL) for 1 year, and will be shared with CDF. Likely new hire has to decide between the lattice QCD work and work for D0, but I know he's leaning heavily towards us. Projects anticipated for summer: 1. Biggest project will be the procurement of new systems. Budget is still in the air, but the Computing Division is likely to match each salary doller in the grant with a dollar of equipment money. So, I expect to spend between $500K and $650K, translating into approximately 150 dual Pentium 4 systems with Myrinet. The work of the project: - evaluate dual P4 performance as soon as possible - evalutae dual Athlon performance as soon as possible - issue RFP. Anticipate same hardware agnosticism as last year, but figure of merit will include single as well as multiple node performance (need to pick MILC benchmark for the latter - any suggestions? Will probably go with ks_3flav). - evaluate and let contract. Install cluster (September). 2. Optimized x86 kernels: Need to integrate SSE-based kernels into MILC code. Also, we need to address issues of precision (SSE uses 32 bit floats, non-SSE uses 64- or 80-bit intermediate floats), and floating point exceptions. Will also code kernels for Athlon using 3DNow!. 3. Measure impact of latency on MILC. Gigabit ethernet is still more expensive per node because of switch costs, but this may eventually change. However, Myrinet's 10 usec latency is still much better than the anticipated 25 - 35 usec latency on G.E. We will add variable latency to Myrinet driver and use this to measure sensitivity of MILC codes to latency. 4. I/O. Lots of issues here - does MILC need generalized parallel I/O routines? How about checkpointing - should we use parallel I/O to something like a PVFS file system, or explore redundant checkpointing involving a neighbor node? File formats? 5. Sustained manpower for cluster. FNAL is now in the business of running a production cluster. Manpower will be allocated to this, and we will continue to develop codes and procedures scalable to a 1000 node cluster. #2 will finish by August. #1 will likely finish by September, but may slip because of when the motherboards we select actually are released by the manufacturers (P-4's in June or July, dual Athlons could be much later). #3 should finish by July. #4 is very open ended. #5 is by definition open ended. Don Holmgren Fermilab djholm@fnal.gov 630-840-2745