Introduction to Parallelism in CASTEP

to ism in CASTEP Phil Hasnip August 2016

Recap: what does CASTEP do? CASTEP solves the Kohn-Sham equations for electrons in a periodic array of nuclei: Ĥ k [ρ]ψ bk = E bk ψ bk where particle b has the bth solution ( band ) at the Brilliouin zone sampling point k, and Ĥ k [ρ] = 2 2m 2 + ˆV HXC [ρ] + ˆV ext.

Bloch s theorem and plane-waves Recall that Bloch s theorem let us write: ψ bk (r) = e ik r u bk (r), where u k (r) is periodic and e ik.r is an arbitrary phase factor. We express u k (r) as a Fourier series: u bk (r) = G c Gbk e ig r ψ bk (r) = e ik r G c Gbk e ig.r = G c Gbk e i(g+k) r where c Gbk are complex Fourier coefficients, and the sum is over all the reciprocal lattice vectors, or G-vectors.

The wavefunction The wavefunction is one of the main data objects in Castep: ψ bk (r) = G c Gbk e i(g+k) r The complex coefficients c Gbk are what Castep is trying to compute, and take up a lot of the computer s memory. G : a reciprocal lattice vector ( G-vector ) b : band index k : a Brillouin zone sampling point ( k-point )

k-point sampling The bands for different k-points are independent of each other, so we get a different set of Kohn-Sham equations at each: Ĥ k [ρ]ψ bk = E bk ψ bk where ρ(r) = bk ψ bk (r) 2

Where does CASTEP spend its time? Applying Ĥk to ψ bk The kinetic energy is applied in reciprocal-space The local potential is applied in real-space We need to Fourier transform between the two spaces. Orthogonalisation of ψ bk We need to ensure our trial bands are orthogonal to each other. We compute the overlap matrix between all pairs of bands, and invert it.

Fourier transforms A 3D Fourier transform can be performed as 3 separate 1D transformations one in each direction (x, y and z). Time to transform ψ bk (G) ψ bk (r) scales as N G log(n G ). Every band at every k-point has to be transformed, so total time is N G N b N k log(n G ).

Orthogonalisation We construct the band-overlap matrix for each k-point S nmk = ψ nk ψ mk. Total time scales as N G N 2 b N k. Invert S k to find an orthogonalising transformation at each k-point. Total time scales as N 3 b N k. Apply transformation to get orthogonal bands. Total time scales as N G N 2 b N k.

Large calculations As we simulate larger and larger systems, N G and N b increase and N k decreases. Time for Fourier transforms scales as N G log N G N b N k. Time for orthogonalisation scales as N G N 2 b N k. Orthogonalisation dominates in large calculations.

As the simulation system gets bigger and bigger, the orthogonalisation time dominates. We want to be able to use more computer cores in our calculation to speed it up how can we do this?

k-point parallelism Bands at different k-points are almost entirely independent of each other give each core a subset of the k-points. each core solves a subset of Kohn-Sham equations Cores only communicate when constructing the density ρ(r) = bk ψ bk (r) 2

TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors

k-point parallelism in action

k-points and large systems k-point parallelism is almost perfect. As simulations get bigger, N k gets smaller. the bigger the simulation, the fewer the cores we can use!

G-vector parallelism S nmk = ψ nk ψ mk = G c Gnk c Gmk Orthogonalisation is a sum over G-vectors give each core a subset of G-vectors. Contributions to S are summed over cores. N G is large so can use lots of cores. As simulation size increases, N G also increases.

G-vector parallelism in action

G-vector parallelism Each core only has some of the G-vectors. Each core only has some of the real-space r-vectors. Fourier transform: all G-vectors contribute at all points in real-space.

G-vector parallelism 3D transform can be performed as 3 1D transforms Give each core all G-vectors in a column in z Each core does transform in z All cores swap data so they have y-column data Each core does transform in y All cores swap data so they have x-column data Each core does transform in x Each core ends up with real-space data in x

G-vector parallelism Start: G-vectors inside cut-off sphere put on grid.

G-vector parallelism Now perform FFT in z-direction...

G-vector parallelism Transpose (swap) data into y-columns.

G-vector parallelism Now perform FFT in y-direction...

G-vector parallelism Transpose data into x-columns.

G-vector parallelism Now perform FFT in x-direction...

G-vector parallelism Now have real-space data in x-columns.

G-vector parallelism Actual transforms distribute well Transpositions are a problem Every core has to communicate with every other core! time scales as N 2 core. as N core increases, Fourier transform will dominate. when the communication time is comparable to the compute time, there s no point using more cores (it might even make CASTEP slower). There are ways to optimise the FFT, but the basic problem remains the same.

G-vector parallelism in action

k and G parallelism k-point and G-vector parallelism is independent Can combine both to improve scaling E.g. if N k = 2, N G = 9, 000 and N core = 6: Data k-point 1 k-point 2 G-vectors 1-3,000 core 1 core 4 G-vectors 3,001-6,001 core 2 core 5 G-vectors 6,001-9,000 core 3 core 6 For any k-point, the G-vector data is split across 3 cores this is 3-way G-vector parallelism. For any subset of G-vectors the k-point data is split between 2 cores this is 2-way k-point parallelism

k+g parallelism in action

Al 2 O 3 Benchmark The Al 2 O 3 surface simulation is a larger standard benchmark 270 atoms 2 k-points 778 bands 88,184 G-vectors

k+g parallelism in action

Multi-core nodes Common to have multi-core processors May have several processors per node Each core on the same node can access shared memory (RAM) Communications can use this shared RAM instead of the network Access via the parameter: num_proc_in_smp : <integer>

k+g+smp parallelism in action

Optimal performance It is always worth exploiting k-point parallelism when you can, but not all computers let you run on any number of cores. If you can t use N core = N k then try to have a high common factor between them. E.g. if N k = 35, N core = 35 will give an excellent speed-up, but N core = 5 or 7 will also be very efficient. N core = 20 or 21 would use some G-vector parallelism, but also give good efficiency. (Note that 2-way G-vector parallelism is not very quick, so N core = 10 or 14 might not be the best choices.)

Optimal performance Remember that as you increase the G-vector parallelism, the communication time increases. Eventually your calculation will scale poorly, and if you keep increasing N core it will even start to run slower. CASTEP always defaults to using as much k-point parallelism as it can, and then uses G-vector parallelism across any other cores.

Very large calculations and the For isolated or very large simulation systems you only need 1 k-point. For well-isolated or extremely large simulations this k-point can be the special point k=(0,0,0), called the. Why do we care?

calculations Bands at Γ are real in real-space, not complex. the Fourier coefficients for -G are the complex conjugate of those at G don t bother with -G; only need half the G-vectors Inner products are real, not complex we don t need to bother computing the imaginary parts. Bands take up only half as much memory. FFT about 2x faster, orthogonalisation 8x faster. CASTEP detects if you re only using the and uses these optimisations automatically.

calculations No k-point parallelism (N k = 1). Orthogonalisation speed-up better than FFT one Can show poorer scaling. Worth using if this k-point sampling is sufficiently accurate. Occasionally worth using a bigger simulation cell if it allows accurate sampling.

Wavefunction revision Recall that the wavefunction is: ψ bk (r) = G c Gbk e i(g+k) r G : a reciprocal lattice vector ( G-vector ) distributed in G-vector parallelism k : a Brillouin zone sampling point ( k-point ) distributed in k-point parallelism b : band index can we distribute bands? Yes!

Hamiltonian is the same for all bands at the same k-point Fourier transforms of different bands are independent perfect scaling with band-parallelism when applying the Hamiltonian.

Orthogonalisation Need to construct overlap matrix S at each k-point S nm = ψ n ψ m Inner product is between all pairs of bands Need all-to-all communications as band-parallelism increases, communication dominates

Band distribution We distribute the bands in a round-robin fashion, e.g. if N b = 11 and N core = 3: Core Bands 1 1,4,7,10 2 2,5,8,11 3 3,6,9 For the FFTs each core just transforms its own bands.

Band parallelism performance FFTs scale perfectly Orthogonalisation requires all-to-all amongst the cores communication time scales as N 2 core communications dominate as N core increases

k, G and B parallelism k-point, G-vector and band-parallelism are independent Can combine all three to improve scaling Define: kp-group: group of cores with same G-vectors and bands, but different k-points gv-group: group of cores with same k-points and bands, but different G-vectors bnd-group: group of cores with same G-vectors and k-points, but different bands

k+g+smp+band parallelism in action

CASTEP performance Everything scales well with k-point parallelism As the number of cores in the gv-group increases, the communication time in the FFT dominates As the number of cores in the band-group increases, the communication time in the orthogonalisation dominates need to find the right balance between gv- and bnd-parallelism.

Using band-parallelism Relatively new functionality Will compute ground-state energies, forces and stresses Accessed via a devel_code string, e.g.: devel_code : bandpar=2 in your.param file would use 2-way band-parallelism.

Why a devel_code? ism and G-vector parallelism use the network differently Low-latency networks good G-vector parallelism High-bandwidth networks good band-parallelism Difficult for CASTEP to decide the best parallelisation strategy Some limitations to current implementation

Limitations Not all tasks supported; currently supports tasks: energy geometryoptimisation moleculardynamics DFT+U not supported These limitations are temporary!

k-point parallelism is very efficient, but you eventually run out of k-points G-vector parallelism is good but becomes worse as you use more and more cores. ism available for many calculations; fairly good but becomes worse with more and more cores Combining these parallelisms allows Castep to scale well to many cores, but careful choice of the no. cores can improve performance considerably: Know how many k-points you re using (-dryrun) For phonons, use the phonon_kpoints tool NB Path-integral MD has an extra level of parallelism task-farming which is very efficient.