Hi Folks, Here is the quick look at compiler performance that I described at the beginning of today's call. Clearly, we are going to supply assembly-coded versions for the most critical such routines, but there are a couple hundred of them, and I doubt we will have the patience to assembly-code them all, so it is still desirable to make the C/C++ version as efficient as we can. Regards, Carleton ---------------------------------------------------------------------- Analysis of C/C++ Coding Strategies for Level 1 Comparison of compiler performance with different coding strategies for Level 1 routines. Example chosen was SU(3) matrix times vector. Coding strategies considered: (1) "overload" overloaded complex arithmetic operators // su3_matrix* a, su3_vector* c, b register int i; for( i=0; i<3; i++ ) c->c[i] = a->e[i][0]*b->c[0] + a->e[i][1]*b->c[1] + a->e[i][2]*b->c[2]; (2) "macro" style C register int i; register complex x,y; for(i=0;i<3;i++){ x.re=x.im=0.0; CMUL( a->e[i][0] , b->c[0] , x ); CMUL( a->e[i][1] , b->c[1] , y ); CSUM( x , y ); CMUL( a->e[i][2] , b->c[2] , y ); CSUM( x , y ); c->c[i] = x; } (3) "handopt" Hand pre-optimized C register int i; register float t,ar,ai,br,bi,cr,ci; for(i=0;i<3;i++){ ar=a->e[i][0].re; ai=a->e[i][0].im; br=b->c[0].re; bi=b->c[0].im; cr=ar*br; t=ai*bi; cr -= t; ci=ar*bi; t=ai*br; ci += t; ar=a->e[i][1].re; ai=a->e[i][1].im; br=b->c[1].re; bi=b->c[1].im; t=ar*br; cr += t; t=ai*bi; cr -= t; t=ar*bi; ci += t; t=ai*br; ci += t; ar=a->e[i][2].re; ai=a->e[i][2].im; br=b->c[2].re; bi=b->c[2].im; t=ar*br; cr += t; t=ai*bi; cr -= t; t=ar*bi; ci += t; t=ai*br; ci += t; c->c[i].re=cr; c->c[i].im=ci; } Compilers tested: (a) SPARC g++: gcc version 2.95.3 20010315 (release) (b) SPARC CC (SunOS 5.7) (c) Intel g++: gcc version 2.96 20000731 Conclusion: The hand-optimized C version gave the most optimum code in all cases. Results are based on an instruction count, not on machine cycles: (a) g++ -O3 did not unroll the loop in any of the examples So here are the loop instruction counts: within loop: overload macro handopt add 1 1 1 addcc 1 1 1 bpos 1 1 1 fadds 7 7 7 fmuls 12 12 12 fsubs 3 3 3 ld 14 14 12 st 12 6 2 --------------------------------------------- total 51 45 39 So the handopt version wins. (b) The SUN CC compiler unrolled the loop automatically CC -fast -dalign -libmil -fsimple=2 -fns Dropping a few intitialization instructions, we get overload macro handopt fadds 21 21 21 fmovs 0 9 0 fmuls 36 36 36 fsubs 9 9 9 ld 53 39 36 st 34 9 6 --------------------------------------------- total 153 123 108 (c) Intel g++ -O3 did not unroll the loop, so here is the loop instruction count overload macro handopt addl 2 2 2 cmpl 1 1 0 decl 0 0 1 faddp 5 3 7 fadds 2 4 0 fld 6 0 6 flds 16 16 12 fmul 6 0 6 fmulp 6 0 6 fmuls 0 12 0 fstps 8 10 2 fsts 2 0 0 fsubp 3 3 2 fsubrp 0 0 1 fxch 14 6 13 incl 1 1 0 jle 1 1 0 jns 0 0 1 movl 12 12 0 popl 4 4 4 ----------------------------------------------- total 90 77 64 ----------------------------------------------------------------------