Cloud, Distributed, Embedded. Erlang in the Heterogeneous Computing World. Omer

Similar documents
7),8) (GPU) SIMD ClearSpeed (GSIC) 53% TSUBAME. NVIDIA Tesla GPU TFlops. Tokyo Institute of Technology 2 JST, CREST

Computing s Energy Problem:

Computational Challenges in Cold QCD. Bálint Joó, Jefferson Lab Computational Nuclear Physics Workshop SURA Washington, DC July 23-24, 2012

igh erformance omputing

A 28nm SoC with a 1.2GHz 568nJ/ Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications

Out-of-Core Cholesky Factorization Algorithm on GPU and the Intel MIC Co-processors

Fast Floating Point Compression on the Cell BE Processor

Flight Software Overview

& Role of Powerline Carrier

Strategy, Developments & Outlook SESP September 2010 ESTEC, Noordwijk, The Netherlands

Reduction of Bitstream Transfer Time in FPGA

Digi Connect ME 9210 Linux: 2 nd serial over FIM

Virtual Breadboarding. John Vangelov Ford Motor Company

GOLOMB Compression Technique For FPGA Configuration

WHEN WILL YOUR MULTI-TERABYTE IMAGERY STOP REQUIRING YOU TO BUY MORE DATA STORAGE?

WHY CHINA IS A GOOD MARKET FOR FD SOI

AGW SYSTEMS. Blue Clock W38X

Agenda Item I.5.b Supplemental NOAA Presentation November 2012

Edgecore ASFvOLT16 VOLTHA Adapter and Driver. Kim Kempf, Sr. Systems Architect CORD Build 2017, San Jose November 8, 2017 MT08.3.

CS 341 Computer Architecture and Organization. Lecturer: Bob Wilson Cell Phone: or

Compiling for Multi, Many, and Anycore. Rudi Eigenmann Purdue University

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

Implementation of Height Measurement System Based on Pressure Sensor BMP085

Instrument pucks. Copyright MBARI Michael Risi SIAM design review November 17, 2003

Digi Connect ME 9210 Linux: serial port 2 for JTAG modules

Perfect Golf Quick Start Guide

sorting solutions osx separator series

IT S HOW THE GAME IS ONE! Introducing Sync

Instructors: Randy H. Katz David A. PaGerson hgp://inst.eecs.berkeley.edu/~cs61c/fa10. Fall Lecture #39. Agenda

Spacecraft Simulation Tool. Debbie Clancy JHU/APL

Using MATLAB with CANoe

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

Flexible Software for Computer-Based Problem Solving Labs

Investigating the Problems of Ship Propulsion on a Supercomputer

CONSUMER MODEL INSTALLATION GUIDE

Session Objectives. At the end of the session, the participants should: Understand advantages of BFD implementation on S9700

Mac Software Manual for FITstep Pro Version 2

REPORT, RE0813, MIL, ENV, 810G, TEMP, IN HOUSE, , PASS

Setting up the Ingenico isc250 Pinpad via USB in Windows 8

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number2- Dec 2014

Free QA! David Golden

MIKE Release General product news for Marine software products, tools & features. Nov 2018

How Game Engines Can Inspire EDA Tools Development: A use case for an open-source physical design library

Scaling. Krish Chakrabarty 1. Scaling

SoundCast Design Intro

82C288 BUS CONTROLLER FOR PROCESSORS (82C C C288-8)

Visual Traffic Jam Analysis Based on Trajectory Data

1001ICT Introduction To Programming Lecture Notes

Software for electronic scorekeeping of volleyball matches, developed and distributed by:

Setting Up the Ingenico isc250 Pinpad via USB

in Hull Richard Green

securing networks with silicon

Distributed Power Management: Technical Deep Dive + Real World Example

Smart Data Role computers play in Technology

Distributed Control Systems

Solving the problem of serving large image mosaics. Using ECW Connector and Image Web Server with ArcIMS

ParaFEM: Microstructurally Faithful Modelling of Materials. Louise M. Lever, University of Manchester

An effective approach for wide area detailed seabed mapping

Decompression Method For Massive Compressed Files In Mobile Rich Media Applications

siot-shoe: A Smart IoT-shoe for Gait Assistance (Miami University)

SC17, Denver, November 14, Highlights of the 50 th. TOP500 List. Erich Strohmaier

VLSI Design 12. Design Styles

Training Fees 3,400 US$ per participant for Public Training includes Materials/Handouts, tea/coffee breaks, refreshments & Buffet Lunch.

Persistent Memory Performance Benchmarking & Comparison. Eden Kim, Calypso Systems, Inc. John Kim, Mellanox Technologies, Inc.

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering

HPC Market Update October Addison Snell Christopher Willard, Ph.D. Laura Segervall

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management

White Rabbit Applications for Data Acquisition Systems

Dealing with Dependent Failures in Distributed Systems

An Efficient Code Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods

Feasibility of Using the Wiimote Multi-point Interactive Whiteboard in School

Using DDT. Debugging programs with DDT. Peter Towers. HPC Systems Section.

Image compression: ER Mapper 6.0 ECW v2.0 versus MrSID 1.3

Using DDT. Debugging programs with DDT. Peter Towers. HPC Systems Section. ECMWF January 28, 2016

Quadruple mass spectrometers (transducer type)

Lua. {h-koba, j-inoue, and Abstract

An Architecture for Combined Test Data Compression and Abort-on-Fail Test

NASCAR Media Group CASE STUDY: LOCATION: Charlotte, NC GOAL: SOLUTION:

Assertion-Based Verification

Software Reliability 1

CAAD CTF 2018 Rules June 21, 2018 Version 1.1

Industrial Compressor Controls Standard Custom

Average Accuracy within 0.1 mph, Calls Out Speeds, Tracks Results

Prediction of Basketball Free Throw Shooting by OpenPose

How is SkyTrak different from other launch monitors?

VOLUME 3. Published by:

Fast Software-managed Code Decompression

Microsoft System Center Data

SPRUCING UP YOUR ITS. Sheffield City Council - August Brent Collier Stephanie Cooper Ben Hallworth

CC-Log: Drastically Reducing Storage Requirements for Robots Using Classification and Compression

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

Status of SuperKEKB Control System. April 25, 2012 Tatsuro NAKAMURA KEKB Control Group, KEK

Software Manual for FITstep Pro Version 2

Adobe Captivate Monday, February 08, 2016

Bayesian Optimized Random Forest for Movement Classification with Smartphones

Implementation of Modern Traffic Light Control System

The HumiSys. RH Generator. Operation. Applications. Designed, built, and supported by InstruQuest Inc.

2017 LOCKHEED MARTIN CORPORATION. ALL RIGHTS RESERVED

Wind Plant Operator Data User's Guide

PISCATUS 3D OVERVIEW PISCATUS 3D MODES. Piscatus 3D will increase your efficiency and reduce your costs

Transcription:

Cloud, Distributed, Embedded. Erlang in the Heterogeneous Computing World Omer Kilic @OmerK omer@erlang-solutions.com

Outline Challenges in modern computing systems Heterogeneous computing Co-processors and accelerators Programming models and tools Alternate architectures Parallella Vision System Erlang Embedded Project Q&A Slide 2 of 46

Challenges: Software Frequency wall Memory bottlenecks Software complexity Slide 3 of 46

Amdahl s Law the maximum speed-up through parallel processing is set by the amount of code which has to run serial Slide 4 of 46

Challenges: Hardware Yield issues Wiring and interconnect Thermal density Power consumption End of Moore s law imminent Slide 5 of 46

Challenges With nearly 10 billion devices connected to the internet and predictions for exponential growth, we ve reached a point where the space, power, and cost demands of traditional technology are no longer sustainable. Meg Whitman President and CEO, HP Slide 6 of 46

Internet of Things Slide 7 of 46

Device Architectures (I) Slide 8 of 46

Device Architectures (II) Slide 9 of 46

Heterogeneous Computing (I) Special purpose, highly specialised architectures will outperform general purpose processing devices Possibly by orders of magnitude In terms of energy efficiency as well as raw speed Parallel execution is key Non-programmable/pseudo-programmable accelerators: ASIC, DSP, GPU, Fully programmable accelerators: FPGAs Slide 10 of 46

Open Compute Project Slide 11 of 46

Heterogeneous Computing (II) Slide 12 of 46

GPUs Slide 13 of 46

Anatomy of a GPU Slide 14 of 46

Co-processors: NetFPGA 10G Slide 15 of 46

Co-processors: Generic COTS devices Slide 16 of 46

Landscape of accelerator programming Interface CUDA OpenCL DirectCompute RenderScript Originator NVIDIA Khronos (Apple) Microsoft Google Year 2007 2008 2009 2011 Area HPC, desktop Desktop, mobile, embedded, HPC OS Windows, Linux, Mac OS Windows, Linux, Mac OS (10.6+) Devices GPUs (NVIDIA) CPUs, GPUs, custom Desktop Mobile Windows (Vista+) Android (3.0+) GPUs (NVIDIA, AMD) CPUs, GPUs, DSPs Work unit Kernel Kernel Compute shader Compute script Language CUDA C/C++ OpenCL C HLSL Script C Distributed Source, PTX Source Source, bytecode LLVM bitcode From: The landscape of accelerator programming: a view from ARM, Lokhmotov, A., 3 rd UK GPU Computing Conference, London Slide 17 of 46

Accelerator types Programmable accelerators CPU Vector extensions: x86/sse/avx, PowerPC/VMX, ARM/NEON GPUs supporting general-purpose computing (GPGPUs) Sony/Toshiba/IBM Cell (Sony PlayStation 3, HPC) ClearSpeed CSX (HPC, embedded) Adapteva Epiphany (HPC, mobile) Intel MIC (HPC) Slide 18 of 46

Programming accelerators Proprietary low-level APIs, typically C-based: Vector intrinsics NVIDIA CUDA ATI Brook+ ClearSpeed Cn No software portability, obsolescence risk. Slide 19 of 46

OpenCL (I) OpenCL (Open Computing Language) is an open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs. Slide 20 of 46

OpenCL (II) Allows you to write C like code which executes on GPUs and many other devices CPUs, FPGAs, various other architectures Key point is data parallelism: applying the same function to a large amount of data Allows us to leverage devices like GPUs from Erlang easily with a minimal wrapper Slide 21 of 46

The Parallella Board Slide 22 of 46

Shiny prototype! Slide 23 of 46

The Parallella Board Slide 24 of 46

Epiphany Architecture Slide 25 of 46

Epiphany-IV 64-core 28nm (E64G401) 64 High Performance RISC CPU Cores 800 MHz Operating Frequency 100 GFLOPS Peak Performance 1.6 TB/s Local Memory Bandwidth 102 GB/s Network-On-Chip Bisection Bandwidth 6.4 GB/s Off-Chip Bandwidth 2 MB On-Chip Distributed Shared Memory 2 Watt Maximum Chip Power Consumption IEEE Floating Point Instruction Set Fully-featured ANSI-C/C++ programmable GNU/Eclipse based tool chain Source synchronous LVDS off chip links for host or direct chip-tochip interfacing. Chip to chip links for integrating up to 64 chips on a single board Slide 26 of 46

Parallella Vision Demo - Overview Slide 27 of 46

Parallella Vision Demo - Cameras Slide 28 of 46

Parallella Vision Demo - Architecture Slide 29 of 46

OpenCL and Erlang Erlang is not that great for crunching image data. This is where OpenCL fits in. Erlang provides an environment around OpenCL. Our server implementation collect frames, offloads processing to Epiphany and send results back. Low latency distributed communications and message passing between processes and nodes Monitoring and supervision facilities Glue between heterogeneous nodes Slide 30 of 46

OpenCL on the Parallella Parallella is a little different than standard GPUs Work sizes are different (smaller amount of cores compared to GPU) Requires some forethought into structuring your kernels Slide 31 of 46

Parallella and Erlang Ubuntu armhf packages up and running Will be included in the standard distro image Vision Demo code available now https://github.com/esl/parcv Slide 32 of 46

Embedded Landscape Slide 34 of 46

#include <stats.h> Source: http://embedded.com/electronics-blogs/programming-pointers/4372180/unexpected-trends Slide 35 of 46

External Interfaces in Erlang Slide 36 of 46

Accessing hardware Peripherals are memory mapped Access via /dev/mem Faster, needs root, potentially dangerous! or by kernel modules/sysfs Slower, doesn t need root, easier, relatively safer Generally very messy Slide 37 of 46

Introducing Erlang/ALE Actor Library for Embedded http://github.com/esl/erlang-ale Slide 38 of 46

Erlang/ALE Brings embedded peripheral interfaces into the Erlang domain Provides easy to use, familiar abstractions for Erlang programmers Uses Raspberry Pi as reference platform, easy to port it to other embedded platforms Open source (Apache version 2) Slide 39 of 46

Beta release Based on pihwm http://omerk.github.io/pihwm GPIO and GPIO interrupts, SPI, I2C and PWM peripherals supported Documentation, supporting material and educational package under development Slide 40 of 46

ALE Example: Blink! {ok, _} = gpio:start_link(?led_pin, output), blink() -> gpio:write(?led_pin, 1), timer:sleep(1000), gpio:write(?led_pin, 0), timer:sleep(1000). Slide 41 of 46

ALE Example: Interrupts {ok, _} = gpio:start_link(?in_pin, input), ok = gpio:set_int(?in_pin, rising), handle_info({gpio_interrupt, _Pin, _Condition}, State) -> blink(). Slide 42 of 46

Hardware Projects Demo Board Slide 43 of 46

Packages for Embedded Architectures https://www.erlang-solutions.com/downloads/download-erlang-otp Slide 44 of 46

Erlang Slide 45 of 46

Thank you http://erlang-embedded.com embedded@erlang-solutions.com @ErlangEmbedded The world is concurrent. Things in the world don't share data. Things communicate with messages. Things fail. - Joe Armstrong Father of Erlang Slide 46 of 46