Cloud, Distributed, Embedded. Erlang in the Heterogeneous Computing World Omer Kilic @OmerK omer@erlang-solutions.com
Outline Challenges in modern computing systems Heterogeneous computing Co-processors and accelerators Programming models and tools Alternate architectures Parallella Vision System Erlang Embedded Project Q&A Slide 2 of 46
Challenges: Software Frequency wall Memory bottlenecks Software complexity Slide 3 of 46
Amdahl s Law the maximum speed-up through parallel processing is set by the amount of code which has to run serial Slide 4 of 46
Challenges: Hardware Yield issues Wiring and interconnect Thermal density Power consumption End of Moore s law imminent Slide 5 of 46
Challenges With nearly 10 billion devices connected to the internet and predictions for exponential growth, we ve reached a point where the space, power, and cost demands of traditional technology are no longer sustainable. Meg Whitman President and CEO, HP Slide 6 of 46
Internet of Things Slide 7 of 46
Device Architectures (I) Slide 8 of 46
Device Architectures (II) Slide 9 of 46
Heterogeneous Computing (I) Special purpose, highly specialised architectures will outperform general purpose processing devices Possibly by orders of magnitude In terms of energy efficiency as well as raw speed Parallel execution is key Non-programmable/pseudo-programmable accelerators: ASIC, DSP, GPU, Fully programmable accelerators: FPGAs Slide 10 of 46
Open Compute Project Slide 11 of 46
Heterogeneous Computing (II) Slide 12 of 46
GPUs Slide 13 of 46
Anatomy of a GPU Slide 14 of 46
Co-processors: NetFPGA 10G Slide 15 of 46
Co-processors: Generic COTS devices Slide 16 of 46
Landscape of accelerator programming Interface CUDA OpenCL DirectCompute RenderScript Originator NVIDIA Khronos (Apple) Microsoft Google Year 2007 2008 2009 2011 Area HPC, desktop Desktop, mobile, embedded, HPC OS Windows, Linux, Mac OS Windows, Linux, Mac OS (10.6+) Devices GPUs (NVIDIA) CPUs, GPUs, custom Desktop Mobile Windows (Vista+) Android (3.0+) GPUs (NVIDIA, AMD) CPUs, GPUs, DSPs Work unit Kernel Kernel Compute shader Compute script Language CUDA C/C++ OpenCL C HLSL Script C Distributed Source, PTX Source Source, bytecode LLVM bitcode From: The landscape of accelerator programming: a view from ARM, Lokhmotov, A., 3 rd UK GPU Computing Conference, London Slide 17 of 46
Accelerator types Programmable accelerators CPU Vector extensions: x86/sse/avx, PowerPC/VMX, ARM/NEON GPUs supporting general-purpose computing (GPGPUs) Sony/Toshiba/IBM Cell (Sony PlayStation 3, HPC) ClearSpeed CSX (HPC, embedded) Adapteva Epiphany (HPC, mobile) Intel MIC (HPC) Slide 18 of 46
Programming accelerators Proprietary low-level APIs, typically C-based: Vector intrinsics NVIDIA CUDA ATI Brook+ ClearSpeed Cn No software portability, obsolescence risk. Slide 19 of 46
OpenCL (I) OpenCL (Open Computing Language) is an open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs. Slide 20 of 46
OpenCL (II) Allows you to write C like code which executes on GPUs and many other devices CPUs, FPGAs, various other architectures Key point is data parallelism: applying the same function to a large amount of data Allows us to leverage devices like GPUs from Erlang easily with a minimal wrapper Slide 21 of 46
The Parallella Board Slide 22 of 46
Shiny prototype! Slide 23 of 46
The Parallella Board Slide 24 of 46
Epiphany Architecture Slide 25 of 46
Epiphany-IV 64-core 28nm (E64G401) 64 High Performance RISC CPU Cores 800 MHz Operating Frequency 100 GFLOPS Peak Performance 1.6 TB/s Local Memory Bandwidth 102 GB/s Network-On-Chip Bisection Bandwidth 6.4 GB/s Off-Chip Bandwidth 2 MB On-Chip Distributed Shared Memory 2 Watt Maximum Chip Power Consumption IEEE Floating Point Instruction Set Fully-featured ANSI-C/C++ programmable GNU/Eclipse based tool chain Source synchronous LVDS off chip links for host or direct chip-tochip interfacing. Chip to chip links for integrating up to 64 chips on a single board Slide 26 of 46
Parallella Vision Demo - Overview Slide 27 of 46
Parallella Vision Demo - Cameras Slide 28 of 46
Parallella Vision Demo - Architecture Slide 29 of 46
OpenCL and Erlang Erlang is not that great for crunching image data. This is where OpenCL fits in. Erlang provides an environment around OpenCL. Our server implementation collect frames, offloads processing to Epiphany and send results back. Low latency distributed communications and message passing between processes and nodes Monitoring and supervision facilities Glue between heterogeneous nodes Slide 30 of 46
OpenCL on the Parallella Parallella is a little different than standard GPUs Work sizes are different (smaller amount of cores compared to GPU) Requires some forethought into structuring your kernels Slide 31 of 46
Parallella and Erlang Ubuntu armhf packages up and running Will be included in the standard distro image Vision Demo code available now https://github.com/esl/parcv Slide 32 of 46
Embedded Landscape Slide 34 of 46
#include <stats.h> Source: http://embedded.com/electronics-blogs/programming-pointers/4372180/unexpected-trends Slide 35 of 46
External Interfaces in Erlang Slide 36 of 46
Accessing hardware Peripherals are memory mapped Access via /dev/mem Faster, needs root, potentially dangerous! or by kernel modules/sysfs Slower, doesn t need root, easier, relatively safer Generally very messy Slide 37 of 46
Introducing Erlang/ALE Actor Library for Embedded http://github.com/esl/erlang-ale Slide 38 of 46
Erlang/ALE Brings embedded peripheral interfaces into the Erlang domain Provides easy to use, familiar abstractions for Erlang programmers Uses Raspberry Pi as reference platform, easy to port it to other embedded platforms Open source (Apache version 2) Slide 39 of 46
Beta release Based on pihwm http://omerk.github.io/pihwm GPIO and GPIO interrupts, SPI, I2C and PWM peripherals supported Documentation, supporting material and educational package under development Slide 40 of 46
ALE Example: Blink! {ok, _} = gpio:start_link(?led_pin, output), blink() -> gpio:write(?led_pin, 1), timer:sleep(1000), gpio:write(?led_pin, 0), timer:sleep(1000). Slide 41 of 46
ALE Example: Interrupts {ok, _} = gpio:start_link(?in_pin, input), ok = gpio:set_int(?in_pin, rising), handle_info({gpio_interrupt, _Pin, _Condition}, State) -> blink(). Slide 42 of 46
Hardware Projects Demo Board Slide 43 of 46
Packages for Embedded Architectures https://www.erlang-solutions.com/downloads/download-erlang-otp Slide 44 of 46
Erlang Slide 45 of 46
Thank you http://erlang-embedded.com embedded@erlang-solutions.com @ErlangEmbedded The world is concurrent. Things in the world don't share data. Things communicate with messages. Things fail. - Joe Armstrong Father of Erlang Slide 46 of 46