Compiling for Multi, Many, and Anycore. Rudi Eigenmann Purdue University

Similar documents
Analyzing the Processor Bottlenecks in SPEC CPU 2000

Fast Software-managed Code Decompression

Fast Floating Point Compression on the Cell BE Processor

VPC3: A Fast and Effective Trace-Compression Algorithm

6/16/2010 DAG Execu>on Model, Work and Depth 1 DAG EXECUTION MODEL, WORK AND DEPTH

Cloud, Distributed, Embedded. Erlang in the Heterogeneous Computing World. Omer

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Running WIEN2K on Ranger

Using DDT. Debugging programs with DDT. Peter Towers. HPC Systems Section. ECMWF January 28, 2016

Using DDT. Debugging programs with DDT. Peter Towers. HPC Systems Section.

DDT and Totalview. HRSK Practical on Debugging,

Strategy, Developments & Outlook SESP September 2010 ESTEC, Noordwijk, The Netherlands

Profile-driven Selective Code Compression

Cougar 21 MTR Performance Analysis Page 1 of 7

Computational Challenges in Cold QCD. Bálint Joó, Jefferson Lab Computational Nuclear Physics Workshop SURA Washington, DC July 23-24, 2012

A new take at Adaptive Fast Multipole Methods: application, implementation, and hybrid CPU/GPU parallelism

Parsimonious Linear Fingerprinting for Time Series

The GungHo Dynamical Core

Out-of-Core Cholesky Factorization Algorithm on GPU and the Intel MIC Co-processors

- 2 - Companion Web Site. Back Cover. Synopsis

Investigating the Problems of Ship Propulsion on a Supercomputer

Rudolf Eigenmann Curriculum Vita November 2016

7),8) (GPU) SIMD ClearSpeed (GSIC) 53% TSUBAME. NVIDIA Tesla GPU TFlops. Tokyo Institute of Technology 2 JST, CREST

Actualtests ASF 45q. Number: ASF Passing Score: 800 Time Limit: 120 min File Version: 15.5 ASF. Agile Scrum Foundation

Functional safety. Functional safety of Programmable systems, devices & components: Requirements from global & national standards

Linear Compressor Discharge Manifold Design For High Thermal Efficiency

Axial and Centrifugal Compressor Mean Line Flow Analysis Method

Mouth of the Columbia River Jetties Three-Phase Construction Plan

Globalizing localization of data-limited approaches for sustainable resource management

Shearwater GeoServices. Increasing survey productivity and enhancing data quality February 2017 Steve Hepburn Acquisition Geophysicist

Computing s Energy Problem:

The Evolution of Transport Planning

Efficiency Improvement of Rotary Compressor by Improving the Discharge path through Simulation

cudimot: A CUDA toolbox for modelling the brain tissue microstructure from diffusion-mri

A Quadtree-Based Lightweight Data Compression Approach to Processing Large-Scale Geospatial Rasters

Designing Mechanisms for Reliable Internet-based Computing

Reducing Code Size with Run-time Decompression

ParaFEM: Microstructurally Faithful Modelling of Materials. Louise M. Lever, University of Manchester

Application of Computational Fluid Dynamics to Compressor Efficiency Improvement

Track B: Particle Methods Part 2 PRACE Spring School 2012

METRO System Design. Witt&Sohn AG Aug-11

A history-based estimation for LHCb job requirements

PARALLEL IMPLEMENTATION OF THE SOCIAL FORCES MODEL

sorting solutions osx separator series

Roundabout Design 101: Principles, Process, and Documentation

Online Companion to Using Simulation to Help Manage the Pace of Play in Golf

Author s Name Name of the Paper Session. Positioning Committee. Marine Technology Society. DYNAMIC POSITIONING CONFERENCE September 18-19, 2001

Single Phase Pressure Drop and Flow Distribution in Brazed Plate Heat Exchangers

Multi-Function Vehicle

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

An STPA Tool. Dajiang Suo, John Thomas

FIRE PROTECTION. In fact, hydraulic modeling allows for infinite what if scenarios including:

Evaluation of a High Performance Code Compression Method

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

AN AUTONOMOUS DRIVER MODEL FOR THE OVERTAKING MANEUVER FOR USE IN MICROSCOPIC TRAFFIC SIMULATION

Agenda Item I.5.b Supplemental NOAA Presentation November 2012

Alternative architectures for distributed ledgers. Sarah Meiklejohn (University College London)

REPORT DOCUMENTATION PAGE

OPTIMIZING THE LENGTH OF AIR SUPPLY DUCT IN CROSS CONNECTIONS OF GOTTHARD BASE TUNNEL. Rehan Yousaf 1, Oliver Scherer 1

MPCS: Develop and Test As You Fly for MSL

Predicting skipjack tuna dynamics and effects of climate change using SEAPODYM with fishing and tagging data

How Game Engines Can Inspire EDA Tools Development: A use case for an open-source physical design library

Spacecraft Simulation Tool. Debbie Clancy JHU/APL

High usability and simple configuration or extensive additional functions the choice between Airlock Login or Airlock IAM is yours!

Compact Binaries with Code Compression in a Software Dynamic Translator

MEMORY is one of the key driving factors in embeddedsystem

A STUDY OF THE LOSSES AND INTERACTIONS BETWEEN ONE OR MORE BOW THRUSTERS AND A CATAMARAN HULL

A Physics Based Link Model for Tree Vibrations

Investigation on Deep-Array Wake Losses Under Stable Atmospheric Conditions

: A WEB-BASED SOLVER FOR COMPRESSIBLE FLOW CALCULATIONS

Valve Losses in Reciprocating Compressors

Innovative and Robust Design. With Full Extension of Offshore Engineering and Design Experiences.

Hydronic Systems Balance

Instruction Cache Compression for Embedded Systems by Yujia Jin and Rong Chen

A computational study of on-demand accuracy level decomposition for twostage stochastic programs

Replay using Recomposition: Alignment-Based Conformance Checking in the Large

Transition from Scrum to Flow

Development of Fluid-Structure Interaction Program for the Mercury Target

NCHRP Superelevation Criteria for Sharp Horizontal Curves on Steep Grades. Research Team: MRIGlobal Pennsylvania State University

Design and Modeling of a Mobile Robot

Relative Dosimetry. Photons

Nearshore wave-flow modelling with SWASH

Performance of Fully Automated 3D Cracking Survey with Pixel Accuracy based on Deep Learning

Passing Vessel Analysis in Design and Engineering. Eric D. Smith, P.E. Vice President, Moffatt & Nichol

6. EXPERIMENTAL METHOD. A primary result of the current research effort is the design of an experimental

Roundabout Design Principles

Gas Vapor Injection on Refrigerant Cycle Using Piston Technology

Compact Binaries with Code Compression in a Software Dynamic Translator

SUPPLEMENT MATERIALS

Wind Flow Model of Area Surrounding the Case Western Reserve University Wind Turbine

Lifecycle Performance of Escape Systems

CT PET-2018 Part - B Phd COMPUTER APPLICATION Sample Question Paper

New Vessel Fuel Efficient Design and Construction Considerations Medium and Long-Term Options

Fully coupled modelling of complex sand controlled completions. Michael Byrne

OPTIMIZATION OF SINGLE STAGE AXIAL FLOW COMPRESSOR FOR DIFFERENT ROTATIONAL SPEED USING CFD

HWBOINTS REVISION 7. Revision 7 is designed in response to community feedback with the intent of awarding points

Evaluation of Unstructured WAVEWATCH III for Nearshore Application

Template Document of Comparison Tables Generated by the BBOB 2012 Post-Processing: Layout for the BBOB 2012 Noisy Testbed

AGW SYSTEMS. Blue Clock W38X

Polynomial DC decompositions

Transcription:

Compiling for Multi, Many, and Anycore Rudi Eigenmann Purdue University

Compilation Issues in Multi, Many, Anycore and sample contributions Multicore: shared-memory issues revisited => Cetus: parallelizing compiler for multicore Manycore: message-passing programming? => Shared-memory model for message-passing architectures Anycore: heterogeneity will increase complexity by an order of magnitude => Moving compile-time decisions into runtime R. Eigenmann, Compiling for Multi, Many, Anycore 2

Compiler Infrastructure for Multicore Compilation Cetus Successor to Polaris Source-to-source translator for C, C++, Java Supported by the U.S. National Science Foundation Written in Java cetus.ecn.purdue.edu R. Eigenmann, Compiling for Multi, Many, Anycore 3

Manycore: Shared-memory model for message-passing architectures Translating OpenMP to MPI

OpenMP for Message-Passing Architectures 25 2 Our OpenMP to MPI translator Hand-coded MPI Speedup 15 1 5 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 CG EP FT LU IS ART EQUAKE From NAS Translated Par. Benchmarks OpenMP Hand-coded MPI From SPEC OMP21 R. Eigenmann, Compiling for Multi, Many, Anycore 5

OpenMP for Message-Passing Architectures: Comparison with High-Peformance Fortran Execution Time (secs) 3 25 2 15 1 5 CG (CLASS B) on Ethernet 3 2.5 2 LU (CLASS A) on Ethernet 1.5 2 1.5 15 1 1 2 4 8 16 5 Number of Processors MPI Execution Time OpenMP Execution Time HPF Execution Time MPI Scalability 1 2 4 8 16 OpenMP Scalability HPF Scalability Number of Processors Execution Time (secs) Scalability A good case for HPF 8 6 4 2 Scalability A bad case for HPF MPI Execution Time OpenMP Execution Time HPF Execution Time MPI Scalability OpenMP Scalability HPF Scalability R. Eigenmann, Compiling for Multi, Many, Anycore 6

OpenMP for Message-Passing Architectures: How does it Work? All shared data is conceptually replicated. Iterations are statically scheduled onto processors. For j=1,n a[e1]= RHS End for M P M P M P RHS data are local reads M P For j=1,n =a[e2] End for Compiler sends data produced (a[e1]) to all future consumer processors. RHS data (a[e2]) are local reads, again R. Eigenmann, Compiling for Multi, Many, Anycore 7

Anycore: Moving Compile-time Decisions into Runtime R. Eigenmann, Compiling for Multi, Many, Anycore 8

Is there Potential? You bet! Imagine you (the compiler) had full knowledge of input data and execution platform of the program Performance 1 1 You are here Amdahl s law of dynamic optimization 1 1% knowledge R. Eigenmann, Compiling for Multi, Many, Anycore 9

Early Results Relative performance improvement percentage (%) 7 6 5 4 3 2 1 ammp applu apsi Whole-program tuning train data set Subroutine-level tuning train data set Whole-program tuning ref data set Subroutine-level tuning ref data set art equake mesa Tuning Goal: determine the best combination of GCC options mgrid R. Eigenmann, Compiling for Multi, Many, Anycore 1 sixtrack swim wupwise geometric mean

Reduction of Tuning Time through Subroutine-level Tuning Whole-program tuning Subroutine-level tuning 12. Normalized tuning time 1. 8. 6. 4. 2.. ammp applu apsi art equake mesa mgrid sixtrack swim wupwise GeoMean R. Eigenmann, Compiling for Multi, Many, Anycore 11

Ongoing Work Beyond autotuning of compiler options New applications of the tuning system MPI parameter tuning Tuning library selection - (ScalaPack,...) OpenMP to MPI translator Best allocation of computation on host vs. GPGPU R. Eigenmann, Compiling for Multi, Many, Anycore 12

Runtime Work Distribution on GPU vs. Host Speedups over CPU version Speedups over CPU version Speeuup 5 4.5 4 3.5 3 2.5 2 1.5 1.5 CPU GPU Tuned 1 3 5 7 Speedup 14 12 1 8 6 4 2 CPU GPU Tuned 128 256 384 512 768 Dimension size (WA = HA = WB) Dimension size (WA = HA = WB) Arbitrary data dimensions Data dimensions optimal for GPU R. Eigenmann, Compiling for Multi, Many, Anycore 13

Conclusions Compilation issues for (today s) multicores: same as for SMPs but has the importance changed? Manycore, anycore: We need high-level programming models E.g., shared-memory models advanced optimizations E.g., runtime decision support R. Eigenmann, Compiling for Multi, Many, Anycore 14