Introduction to Parallelism in CASTEP

Similar documents
Introducing k-point Parallelism Into VASP. Asimina Maniopoulou Numerical Algorithms Group Ltd, HECToR CSE

Introducing k-point Parallelism into VASP

PUV Wave Directional Spectra How PUV Wave Analysis Works

Fast Floating Point Compression on the Cell BE Processor

Wave Motion. interference destructive interferecne constructive interference in phase. out of phase standing wave antinodes resonant frequencies

Seismic Survey Designs for Converted Waves

Seismic waves. Seismic waves, like all waves, transfer energy from one place to another without moving material. Seismic Waves 1 Author Paul Denton

ROSE-HULMAN INSTITUTE OF TECHNOLOGY Department of Mechanical Engineering. Mini-project 3 Tennis ball launcher

First Steps Towards the AEI 10m Prototype Single Arm Test Auto Alignment

The Evolution of Transport Planning

Section 1 Types of Waves. Distinguish between mechanical waves and electromagnetic waves.

Convection Current Exploration:

6/16/2010 DAG Execu>on Model, Work and Depth 1 DAG EXECUTION MODEL, WORK AND DEPTH

Advanced pre and post-processing in Windsim

Chapter 15 Wave Motion. Copyright 2009 Pearson Education, Inc.

Algorithm for Line Follower Robots to Follow Critical Paths with Minimum Number of Sensors

Massey Method. Introduction. The Process

Waves. harmonic wave wave equation one dimensional wave equation principle of wave fronts plane waves law of reflection

Workshop 1: Bubbly Flow in a Rectangular Bubble Column. Multiphase Flow Modeling In ANSYS CFX Release ANSYS, Inc. WS1-1 Release 14.

Ranking teams in partially-disjoint tournaments

CS 4649/7649 Robot Intelligence: Planning

1. Predict what will happen in the following situation. Sketch below your prediction of the interference pattern when the waves overlap:

Development of Fluid-Structure Interaction Program for the Mercury Target

GOLOMB Compression Technique For FPGA Configuration

CAM Final Report John Scheele Advisor: Paul Ohmann I. Introduction

Chapter # 08 Waves. [WAVES] Chapter # 08

CFD ANALYSIS AND COMPARISON USING ANSYS AND STAR-CCM+ OF MODEL AEROFOIL SELIG 1223

The Usage of Propeller Tunnels For Higher Efficiency and Lower Vibration. M. Burak Şamşul

PIG MOTION AND DYNAMICS IN COMPLEX GAS NETWORKS. Dr Aidan O Donoghue, Pipeline Research Limited, Glasgow

Investigating the Problems of Ship Propulsion on a Supercomputer

Motion Control of a Bipedal Walking Robot

Effect of Diameter on the Aerodynamics of Sepaktakraw Balls, A Computational Study

Vibrations are the sources of waves. A vibration creates a disturbance in a given medium, that disturbance travels away from the source, carrying

EE 364B: Wind Farm Layout Optimization via Sequential Convex Programming

Running WIEN2K on Ranger

Author s Name Name of the Paper Session. Positioning Committee. Marine Technology Society. DYNAMIC POSITIONING CONFERENCE September 18-19, 2001

Out-of-Core Cholesky Factorization Algorithm on GPU and the Intel MIC Co-processors

Analysis and Research of Mooring System. Jiahui Fan*

g L Agenda Chapter 13 Problem 28 Equations of Motion for SHM: What if we have friction or drag? Driven Oscillations; Resonance 4/30/14 k m f = 1 2π

Transverse waves cause particles to vibrate perpendicularly to the direction of the wave's motion (e.g. waves on a string, ripples on a pond).

AGA Swiss McMahon Pairing Protocol Standards

This portion of the piping tutorial covers control valve sizing, control valves, and the use of nodes.

Annual Unit Extended Team Championships. Sanctioning and Reporting Guide. January, 2017

Proceedings of Meetings on Acoustics

Chapter 11 Waves. Waves transport energy without transporting matter. The intensity is the average power per unit area. It is measured in W/m 2.

*MAT GAS MIXTURE, a new gas mixture model for airbag applications

LECTURE 5 TRAVELING WAVES. Instructor: Kazumi Tolich

Horse Farm Management s Report Writer. User Guide Version 1.1.xx

Waves, Light, and Sound

An Analysis of Reducing Pedestrian-Walking-Speed Impacts on Intersection Traffic MOEs

Ocean Fishing Fleet Scheduling Path Optimization Model Research. Based On Improved Ant Colony Algorithm

Golf Ball Impact: Material Characterization and Transient Simulation

Electronic Structure Workshop, Spring 2017

Physics 11. Unit 7 (Part 1) Wave Motion

Paul Burkhardt. May 19, 2016

Solids, Liquids, and Gases

A STUDY OF THE LOSSES AND INTERACTIONS BETWEEN ONE OR MORE BOW THRUSTERS AND A CATAMARAN HULL

COMPLEX NUMBERS. Powers of j. Consider the quadratic equation; It has no solutions in the real number system since. j 1 ie. j 2 1

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

The Estimation Of Compressor Performance Using A Theoretical Analysis Of The Gas Flow Through the Muffler Combined With Valve Motion

What kind of gamer are you?

Traffic circles. February 9, 2009

MECHANICAL WAVES AND SOUND

Besides the reported poor performance of the candidates there were a number of mistakes observed on the assessment tool itself outlined as follows:

A HYBRID METHOD FOR CALIBRATION OF UNKNOWN PARTIALLY/FULLY CLOSED VALVES IN WATER DISTRIBUTION SYSTEMS ABSTRACT

Open Research Online The Open University s repository of research publications and other research outputs

LUMAZOTE T ECHNICAL BRO CHU RE

Organize information about waves. Differentiate two main types of waves.

LOW PRESSURE EFFUSION OF GASES revised by Igor Bolotin 03/05/12

Cover Sheet-Block 6 Wave Properties

Bayesian Optimized Random Forest for Movement Classification with Smartphones

A MODEL FOR ANALYSIS OF THE IMPACT BETWEEN A TENNIS RACKET AND A BALL

U S F O S B u o y a n c y And Hydrodynamic M a s s

International Journal of Technical Research and Applications e-issn: , Volume 4, Issue 3 (May-June, 2016), PP.

OPTIMIZATION OF A WAVE CANCELLATION MULTIHULL SHIP USING CFD TOOLS

Units of Chapter 14. Types of Waves Waves on a String Harmonic Wave Functions Sound Waves Standing Waves Sound Intensity The Doppler Effect

Wind turbine Varying blade length with wind speed

ISOLATION OF NON-HYDROSTATIC REGIONS WITHIN A BASIN

SoundCast Design Intro

Fluids, Pressure and buoyancy

Chapter 10. Physical Characteristics of Gases

AP Physics 1 Summer Packet Review of Trigonometry used in Physics

Olympus Production Contour Plot

ANSWERS TO QUESTIONS IN THE NOTES AUTUMN 2018

Applied Econometrics with. Time, Date, and Time Series Classes. Motivation. Extension 2. Motivation. Motivation

Physics Workbook WALCH PUBLISHING

Simulation of Arterial Traffic Using Cell Transmission Model

Static Fluids. **All simulations and videos required for this package can be found on my website, here:

Chapter 16. Waves-I Types of Waves

Polynomial DC decompositions

Dynamic Programming: The Matrix Chain Algorithm

Note! In this lab when you measure, round all measurements to the nearest meter!

Vibration of floors and footfall analysis

2 When Some or All Labels are Missing: The EM Algorithm

6. An oscillator makes four vibrations in one second. What is its period and frequency?

ITW RANSBURG REA AND VECTOR

Section 2 Multiphase Flow, Flowing Well Performance

Designing a Traffic Circle By David Bosworth For MATH 714

March Madness Basketball Tournament

Transcription:

to ism in CASTEP Phil Hasnip August 2016

Recap: what does CASTEP do? CASTEP solves the Kohn-Sham equations for electrons in a periodic array of nuclei: Ĥ k [ρ]ψ bk = E bk ψ bk where particle b has the bth solution ( band ) at the Brilliouin zone sampling point k, and Ĥ k [ρ] = 2 2m 2 + ˆV HXC [ρ] + ˆV ext.

Bloch s theorem and plane-waves Recall that Bloch s theorem let us write: ψ bk (r) = e ik r u bk (r), where u k (r) is periodic and e ik.r is an arbitrary phase factor. We express u k (r) as a Fourier series: u bk (r) = G c Gbk e ig r ψ bk (r) = e ik r G c Gbk e ig.r = G c Gbk e i(g+k) r where c Gbk are complex Fourier coefficients, and the sum is over all the reciprocal lattice vectors, or G-vectors.

The wavefunction The wavefunction is one of the main data objects in Castep: ψ bk (r) = G c Gbk e i(g+k) r The complex coefficients c Gbk are what Castep is trying to compute, and take up a lot of the computer s memory. G : a reciprocal lattice vector ( G-vector ) b : band index k : a Brillouin zone sampling point ( k-point )

k-point sampling The bands for different k-points are independent of each other, so we get a different set of Kohn-Sham equations at each: Ĥ k [ρ]ψ bk = E bk ψ bk where ρ(r) = bk ψ bk (r) 2

Where does CASTEP spend its time? Applying Ĥk to ψ bk The kinetic energy is applied in reciprocal-space The local potential is applied in real-space We need to Fourier transform between the two spaces. Orthogonalisation of ψ bk We need to ensure our trial bands are orthogonal to each other. We compute the overlap matrix between all pairs of bands, and invert it.

Fourier transforms A 3D Fourier transform can be performed as 3 separate 1D transformations one in each direction (x, y and z). Time to transform ψ bk (G) ψ bk (r) scales as N G log(n G ). Every band at every k-point has to be transformed, so total time is N G N b N k log(n G ).

Orthogonalisation We construct the band-overlap matrix for each k-point S nmk = ψ nk ψ mk. Total time scales as N G N 2 b N k. Invert S k to find an orthogonalising transformation at each k-point. Total time scales as N 3 b N k. Apply transformation to get orthogonal bands. Total time scales as N G N 2 b N k.

Large calculations As we simulate larger and larger systems, N G and N b increase and N k decreases. Time for Fourier transforms scales as N G log N G N b N k. Time for orthogonalisation scales as N G N 2 b N k. Orthogonalisation dominates in large calculations.

As the simulation system gets bigger and bigger, the orthogonalisation time dominates. We want to be able to use more computer cores in our calculation to speed it up how can we do this?

k-point parallelism Bands at different k-points are almost entirely independent of each other give each core a subset of the k-points. each core solves a subset of Kohn-Sham equations Cores only communicate when constructing the density ρ(r) = bk ψ bk (r) 2

TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors

k-point parallelism in action

k-points and large systems k-point parallelism is almost perfect. As simulations get bigger, N k gets smaller. the bigger the simulation, the fewer the cores we can use!

G-vector parallelism S nmk = ψ nk ψ mk = G c Gnk c Gmk Orthogonalisation is a sum over G-vectors give each core a subset of G-vectors. Contributions to S are summed over cores. N G is large so can use lots of cores. As simulation size increases, N G also increases.

G-vector parallelism in action

G-vector parallelism Each core only has some of the G-vectors. Each core only has some of the real-space r-vectors. Fourier transform: all G-vectors contribute at all points in real-space.

G-vector parallelism 3D transform can be performed as 3 1D transforms Give each core all G-vectors in a column in z Each core does transform in z All cores swap data so they have y-column data Each core does transform in y All cores swap data so they have x-column data Each core does transform in x Each core ends up with real-space data in x

G-vector parallelism Start: G-vectors inside cut-off sphere put on grid.

G-vector parallelism Now perform FFT in z-direction...

G-vector parallelism Transpose (swap) data into y-columns.

G-vector parallelism Now perform FFT in y-direction...

G-vector parallelism Transpose data into x-columns.

G-vector parallelism Now perform FFT in x-direction...

G-vector parallelism Now have real-space data in x-columns.

G-vector parallelism Actual transforms distribute well Transpositions are a problem Every core has to communicate with every other core! time scales as N 2 core. as N core increases, Fourier transform will dominate. when the communication time is comparable to the compute time, there s no point using more cores (it might even make CASTEP slower). There are ways to optimise the FFT, but the basic problem remains the same.

G-vector parallelism in action

k and G parallelism k-point and G-vector parallelism is independent Can combine both to improve scaling E.g. if N k = 2, N G = 9, 000 and N core = 6: Data k-point 1 k-point 2 G-vectors 1-3,000 core 1 core 4 G-vectors 3,001-6,001 core 2 core 5 G-vectors 6,001-9,000 core 3 core 6 For any k-point, the G-vector data is split across 3 cores this is 3-way G-vector parallelism. For any subset of G-vectors the k-point data is split between 2 cores this is 2-way k-point parallelism

k+g parallelism in action

Al 2 O 3 Benchmark The Al 2 O 3 surface simulation is a larger standard benchmark 270 atoms 2 k-points 778 bands 88,184 G-vectors

k+g parallelism in action

Multi-core nodes Common to have multi-core processors May have several processors per node Each core on the same node can access shared memory (RAM) Communications can use this shared RAM instead of the network Access via the parameter: num_proc_in_smp : <integer>

k+g+smp parallelism in action

Optimal performance It is always worth exploiting k-point parallelism when you can, but not all computers let you run on any number of cores. If you can t use N core = N k then try to have a high common factor between them. E.g. if N k = 35, N core = 35 will give an excellent speed-up, but N core = 5 or 7 will also be very efficient. N core = 20 or 21 would use some G-vector parallelism, but also give good efficiency. (Note that 2-way G-vector parallelism is not very quick, so N core = 10 or 14 might not be the best choices.)

Optimal performance Remember that as you increase the G-vector parallelism, the communication time increases. Eventually your calculation will scale poorly, and if you keep increasing N core it will even start to run slower. CASTEP always defaults to using as much k-point parallelism as it can, and then uses G-vector parallelism across any other cores.

Very large calculations and the For isolated or very large simulation systems you only need 1 k-point. For well-isolated or extremely large simulations this k-point can be the special point k=(0,0,0), called the. Why do we care?

calculations Bands at Γ are real in real-space, not complex. the Fourier coefficients for -G are the complex conjugate of those at G don t bother with -G; only need half the G-vectors Inner products are real, not complex we don t need to bother computing the imaginary parts. Bands take up only half as much memory. FFT about 2x faster, orthogonalisation 8x faster. CASTEP detects if you re only using the and uses these optimisations automatically.

calculations No k-point parallelism (N k = 1). Orthogonalisation speed-up better than FFT one Can show poorer scaling. Worth using if this k-point sampling is sufficiently accurate. Occasionally worth using a bigger simulation cell if it allows accurate sampling.

Wavefunction revision Recall that the wavefunction is: ψ bk (r) = G c Gbk e i(g+k) r G : a reciprocal lattice vector ( G-vector ) distributed in G-vector parallelism k : a Brillouin zone sampling point ( k-point ) distributed in k-point parallelism b : band index can we distribute bands? Yes!

Hamiltonian is the same for all bands at the same k-point Fourier transforms of different bands are independent perfect scaling with band-parallelism when applying the Hamiltonian.

Orthogonalisation Need to construct overlap matrix S at each k-point S nm = ψ n ψ m Inner product is between all pairs of bands Need all-to-all communications as band-parallelism increases, communication dominates

Band distribution We distribute the bands in a round-robin fashion, e.g. if N b = 11 and N core = 3: Core Bands 1 1,4,7,10 2 2,5,8,11 3 3,6,9 For the FFTs each core just transforms its own bands.

Band parallelism performance FFTs scale perfectly Orthogonalisation requires all-to-all amongst the cores communication time scales as N 2 core communications dominate as N core increases

k, G and B parallelism k-point, G-vector and band-parallelism are independent Can combine all three to improve scaling Define: kp-group: group of cores with same G-vectors and bands, but different k-points gv-group: group of cores with same k-points and bands, but different G-vectors bnd-group: group of cores with same G-vectors and k-points, but different bands

k+g+smp+band parallelism in action

CASTEP performance Everything scales well with k-point parallelism As the number of cores in the gv-group increases, the communication time in the FFT dominates As the number of cores in the band-group increases, the communication time in the orthogonalisation dominates need to find the right balance between gv- and bnd-parallelism.

Using band-parallelism Relatively new functionality Will compute ground-state energies, forces and stresses Accessed via a devel_code string, e.g.: devel_code : bandpar=2 in your.param file would use 2-way band-parallelism.

Why a devel_code? ism and G-vector parallelism use the network differently Low-latency networks good G-vector parallelism High-bandwidth networks good band-parallelism Difficult for CASTEP to decide the best parallelisation strategy Some limitations to current implementation

Limitations Not all tasks supported; currently supports tasks: energy geometryoptimisation moleculardynamics DFT+U not supported These limitations are temporary!

k-point parallelism is very efficient, but you eventually run out of k-points G-vector parallelism is good but becomes worse as you use more and more cores. ism available for many calculations; fairly good but becomes worse with more and more cores Combining these parallelisms allows Castep to scale well to many cores, but careful choice of the no. cores can improve performance considerably: Know how many k-points you re using (-dryrun) For phonons, use the phonon_kpoints tool NB Path-integral MD has an extra level of parallelism task-farming which is very efficient.