A history-based estimation for LHCb job requirements

A history-based estimation for LHCb job requirements Nathalie Rauschmayr on behalf of LHCb Computing 13th April 2015

Introduction N. Rauschmayr 2 How long will a job run and how much memory might it need? Production Manager Underestimation: Job is killed. We lose the whole job! Overestimation: What to do with the remaining time

Introduction N. Rauschmayr 3 Well studied problem in High Performance Computing Some recent studies in HEP: CMS, WLCG multicore task force Multicore jobs: good runtime estimates allow better scheduling

Introduction N. Rauschmayr 4 A lot of job meta data from past jobs: LHCb bookkeeping

Introduction N. Rauschmayr 4 A lot of job meta data from past jobs: LHCb bookkeeping Hardware specific: CPU model HS06 Cache size RAM Size

Introduction N. Rauschmayr 4 A lot of job meta data from past jobs: LHCb bookkeeping Hardware specific: CPU model HS06 Cache size RAM Size Input File Size Number of events Run number (LHCb RunDB) Trigger Configuration Key Avg. Multiplicity Avg. Luminosity...

Introduction N. Rauschmayr 4 A lot of job meta data from past jobs: LHCb bookkeeping Hardware specific: CPU model Job Start/End time HS06 Site Cache size Worker Node RAM Size Memory Footprint Input File Runtime Size Number of events Run number (LHCb RunDB) Trigger Configuration Key Avg. Multiplicity Avg. Luminosity...

Our Proposal N. Rauschmayr 5 Automate the prediction procedure based on prior jobs and given job meta data Supervised Learning Reduce false estimates Simplify Production Manager's life

Our Proposal N. Rauschmayr 6

Workload Analysis - Reconstruction Normalized CPU time per event: CPUTime HEPSPECValue NumberOfEvents Approximate a Gaussian distribution: Reprocessing 2011 6,000 Frequency 4,000 2,000 0 0 10 20 30 CPU-work per event in HS06.s

Workload Analysis - Reconstruction N. Rauschmayr 7 Normalized CPU time per event: CPUTime HEPSPECValue NumberOfEvents Approximate a Gaussian distribution: Reprocessing 2011 Reprocessing 2011 Maximum Likelihood 11.71 HS06.s Frequency 6,000 4,000 2,000 Frequency 6,000 4,000 2,000 0 0 10 20 30 CPU-work per event in HS06.s 0 -sigma +sigma 0 10 20 30 CPU-work per event in HS06.s

Workload Analysis - Reconstruction Comparison of 2011 and 2012 workloads: 2.5 104 2 Reprocessing 2011 2 Reprocessing 2012 2.5 104 Maximum Likelihood 17.91 HS06.s Frequency 1.5 1 0.5 Maximum Likelihood 11.71 HS06.s Frequency 1.5 1 0.5 0 -sigma +sigma 0 5 10 15 20 25 CPU-work per event in HS06.s 0 -sigma +sigma 0 10 20 30 CPU-work per event in HS06.s N. Rauschmayr 8

Workload Analysis - Stripping N. Rauschmayr 9 Stripping: Sort reconstructed events into different output streams 1.5 104 Reprocessing 2011 1.5 104 Reprocessing 2012 Maximum Likelihood 5.87 HS06.s Frequency 1 Maximum Likelihood 5.26 HS06.s 0.5 Frequency 1 0.5 0 +sigma +sigma 0 +sigma +sigma 2 4 6 8 10 CPU-work per event in HS06.s 0 2 4 6 8 10 12 CPU-work per event in HS06.s

Detect Most Important Features N. Rauschmayr 10 Certain well known correlations: Beam energy versus event size

Detect Most Important Features N. Rauschmayr 10 Certain well known correlations: Beam energy versus event size Pileup versus complexity of reconstruction Example: Reconstruction (2011 versus 2012)

Detect Most Important Features N. Rauschmayr 11 Avoid overfitting...

Detect Most Important Features N. Rauschmayr 11 Avoid overfitting... Linear regression: runtime per event = Θ 0 + Θ 1 x 1 + Θ 2 x 2 + Θ 3 x 3 + Θ 4 x 4 + Θ 5 x 5 + Θ 6 x 6

Detect Most Important Features Avoid overfitting... Linear regression: runtime per event = Θ 0 + Θ 1 x 1 + Θ 2 x 2 + Θ 3 x 3 + Θ 4 x 4 + Θ 5 x 5 + Θ 6 x 6 Normalize File Size Avg. Event Size HEPSPEC Number Of Events Avg. Luminosity Avg. Multiplicity

Detect Most Important Features Avoid overfitting... Linear regression: runtime per event = Θ 0 + Θ 1 x 1 + Θ 2 x 2 + Θ 3 x 3 + Θ 4 x 4 + Θ 5 x 5 + Θ 6 x 6 Normalize Find best Θ File Size Avg. Event Size HEPSPEC Number Of Events Avg. Luminosity Avg. Multiplicity 0.80 0.19-0.97-1.33 0.67-0.08

Detect Most Important Features Avoid overfitting... Linear regression: runtime per event = Θ 0 + Θ 1 x 1 + Θ 2 x 2 + Θ 3 x 3 + Θ 4 x 4 + Θ 5 x 5 + Θ 6 x 6 Normalize Find best Θ Remove small Θ File Size Avg. Event Size HEPSPEC Number Of Events Avg. Luminosity Avg. Multiplicity 0.80 0.19-0.97-1.33 0.67-0.08 1.05 x -0.97-1.55 0.59 x

Detect Most Important Features N. Rauschmayr 11 Avoid overfitting... Linear regression: runtime per event = Θ 0 + Θ 1 x 1 + Θ 2 x 2 + Θ 3 x 3 + Θ 4 x 4 + Θ 5 x 5 + Θ 6 x 6 Normalize Find best Θ Remove small Θ Evaluate RMSE File Size Avg. Event Size HEPSPEC Number Of Events Avg. Luminosity Avg. Multiplicity 0.80 0.19-0.97-1.33 0.67-0.08 1.05 x -0.97-1.55 0.59 x 2.16 22% better than naive estimator like MLE

Detect Most Important Features N. Rauschmayr 12 Stripping Jobs: Normalize Find best Θ Remove small Θ Evaluate RMSE File Size Avg. Event Size HEPSPEC Number Of Events Avg. Luminosity Avg. Multiplicity -0.19 0.70 0.19 0.23-0.08 0.003-0.18 0.62 0.19 0.27 x x 0.54 25% better than naive estimator like MLE

Supervised Learning N. Rauschmayr 13 Supervised learning: there exist some labelled training data What if only little amount of training data available?

Supervised Learning N. Rauschmayr 14 1. Find similar jobs which have already run 2. Predict requirements for the next k jobs using either (MLE/LR) 3. When k jobs have finished, update prediction formula with the new results obtained 4. Repeat step 2 and 3 until all jobs have finished k Jobs Training Data k Jobs New Test Data

Supervised Learning Root Mean Squared Error: n i=0 (difference i) 2 n, where difference is predicted minus real values Update after 100 jobs Update after 100 jobs Accumulated RMSE 4 3 2 1 MLE LR RMSE for each k 10 8 6 4 2 0 0 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 Job Number 10 5 Job Number 10 5 N. Rauschmayr 15

Conclusion N. Rauschmayr 16 Historical data can help us to predict future jobs Certain meta data are strongly correlated with runtime Improved prediction up to 25% Both models (naive estimator and linear regression) can be easily implemented in Production and support the Production Manager in his work

N. Rauschmayr 17 Questions

Backup Slides N. Rauschmayr 18 Trigger configuration changes over different runs

Backup Slides N. Rauschmayr 19 Stripping Jobs: Memory and Runtime per Event Accumulated Error Update after 100 jobs 7 104 6.5 6 5.5 5 4.5 MLE LR 0.2 0.4 0.6 0.8 1 1.2 Job Number 10 5 Accumulated Error 2 1.5 1 0.5 Update after 100 jobs MLE LR 0 0 0.2 0.4 0.6 0.8 1 1.2 Job Number 10 5

Backup Slides Reconstruction: Memory Footprint 2.5 104 Reprocessing 2011 Reprocessing 2012 2.5 104 Maximum Likelihood 1.76 GB 2 2 Frequency 1.5 1 0.5 Maximum Likelihood 1.6 GB Frequency 1.5 1 0.5 0 -sigma +sigma 0 -sigma+sigma 1.6 1.8 2 1.4 1.6 1.8 2 Memory in kb 10 6 Memory in kb 10 6 N. Rauschmayr 20

Backup Slides Stripping: Memory Footprint Change from POOL to ROOT 2.5 104 Reprocessing 2011 2.5 104 Reprocessing 2012 2 2 Maximum Likelihood 3.4 GB Frequency 1.5 Maximum Likelihood 2.2 GB 1 0.5 Frequency 1.5 1 0.5 0 -sigma +sigma 0 -sigma+sigma 2 2.2 2.4 2.6 2.8 3 3 3.2 3.4 3.6 3.8 4 Memory in kb 10 6 Memory in kb 10 6 N. Rauschmayr 21