Compiling for Multi, Many, and Anycore Rudi Eigenmann Purdue University
Compilation Issues in Multi, Many, Anycore and sample contributions Multicore: shared-memory issues revisited => Cetus: parallelizing compiler for multicore Manycore: message-passing programming? => Shared-memory model for message-passing architectures Anycore: heterogeneity will increase complexity by an order of magnitude => Moving compile-time decisions into runtime R. Eigenmann, Compiling for Multi, Many, Anycore 2
Compiler Infrastructure for Multicore Compilation Cetus Successor to Polaris Source-to-source translator for C, C++, Java Supported by the U.S. National Science Foundation Written in Java cetus.ecn.purdue.edu R. Eigenmann, Compiling for Multi, Many, Anycore 3
Manycore: Shared-memory model for message-passing architectures Translating OpenMP to MPI
OpenMP for Message-Passing Architectures 25 2 Our OpenMP to MPI translator Hand-coded MPI Speedup 15 1 5 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 CG EP FT LU IS ART EQUAKE From NAS Translated Par. Benchmarks OpenMP Hand-coded MPI From SPEC OMP21 R. Eigenmann, Compiling for Multi, Many, Anycore 5
OpenMP for Message-Passing Architectures: Comparison with High-Peformance Fortran Execution Time (secs) 3 25 2 15 1 5 CG (CLASS B) on Ethernet 3 2.5 2 LU (CLASS A) on Ethernet 1.5 2 1.5 15 1 1 2 4 8 16 5 Number of Processors MPI Execution Time OpenMP Execution Time HPF Execution Time MPI Scalability 1 2 4 8 16 OpenMP Scalability HPF Scalability Number of Processors Execution Time (secs) Scalability A good case for HPF 8 6 4 2 Scalability A bad case for HPF MPI Execution Time OpenMP Execution Time HPF Execution Time MPI Scalability OpenMP Scalability HPF Scalability R. Eigenmann, Compiling for Multi, Many, Anycore 6
OpenMP for Message-Passing Architectures: How does it Work? All shared data is conceptually replicated. Iterations are statically scheduled onto processors. For j=1,n a[e1]= RHS End for M P M P M P RHS data are local reads M P For j=1,n =a[e2] End for Compiler sends data produced (a[e1]) to all future consumer processors. RHS data (a[e2]) are local reads, again R. Eigenmann, Compiling for Multi, Many, Anycore 7
Anycore: Moving Compile-time Decisions into Runtime R. Eigenmann, Compiling for Multi, Many, Anycore 8
Is there Potential? You bet! Imagine you (the compiler) had full knowledge of input data and execution platform of the program Performance 1 1 You are here Amdahl s law of dynamic optimization 1 1% knowledge R. Eigenmann, Compiling for Multi, Many, Anycore 9
Early Results Relative performance improvement percentage (%) 7 6 5 4 3 2 1 ammp applu apsi Whole-program tuning train data set Subroutine-level tuning train data set Whole-program tuning ref data set Subroutine-level tuning ref data set art equake mesa Tuning Goal: determine the best combination of GCC options mgrid R. Eigenmann, Compiling for Multi, Many, Anycore 1 sixtrack swim wupwise geometric mean
Reduction of Tuning Time through Subroutine-level Tuning Whole-program tuning Subroutine-level tuning 12. Normalized tuning time 1. 8. 6. 4. 2.. ammp applu apsi art equake mesa mgrid sixtrack swim wupwise GeoMean R. Eigenmann, Compiling for Multi, Many, Anycore 11
Ongoing Work Beyond autotuning of compiler options New applications of the tuning system MPI parameter tuning Tuning library selection - (ScalaPack,...) OpenMP to MPI translator Best allocation of computation on host vs. GPGPU R. Eigenmann, Compiling for Multi, Many, Anycore 12
Runtime Work Distribution on GPU vs. Host Speedups over CPU version Speedups over CPU version Speeuup 5 4.5 4 3.5 3 2.5 2 1.5 1.5 CPU GPU Tuned 1 3 5 7 Speedup 14 12 1 8 6 4 2 CPU GPU Tuned 128 256 384 512 768 Dimension size (WA = HA = WB) Dimension size (WA = HA = WB) Arbitrary data dimensions Data dimensions optimal for GPU R. Eigenmann, Compiling for Multi, Many, Anycore 13
Conclusions Compilation issues for (today s) multicores: same as for SMPs but has the importance changed? Manycore, anycore: We need high-level programming models E.g., shared-memory models advanced optimizations E.g., runtime decision support R. Eigenmann, Compiling for Multi, Many, Anycore 14