| Literature DB >> 35373221 |
Mohammed Tanash1, Daniel Andresen1, Huichen Yang1, William Hsu1.
Abstract
In this paper, we present a novel methodology for predicting job resources (memory and time) for submitted jobs on HPC systems. Our methodology based on historical jobs data (saccount data) provided from the Slurm workload manager using supervised machine learning. This Machine Learning (ML) prediction model is effective and useful for both HPC administrators and HPC users. Moreover, our ML model increases the efficiency and utilization for HPC systems, thus reduce power consumption as well. Our model involves using Several supervised machine learning discriminative models from the scikit-learn machine learning library and LightGBM applied on historical data from Slurm. Our model helps HPC users to determine the required amount of resources for their submitted jobs and make it easier for them to use HPC resources efficiently. This work provides the second step towards implementing our general open source tool towards HPC service providers. For this work, our Machine learning model has been implemented and tested using two HPC providers, an XSEDE service provider (University of Colorado-Boulder (RMACC Summit) and Kansas State University (Beocat)). We used more than two hundred thousand jobs: one-hundred thousand jobs from SUMMIT and one-hundred thousand jobs from Beocat, to model and assess our ML model performance. In particular we measured the improvement of running time, turnaround time, average waiting time for the submitted jobs; and measured utilization of the HPC clusters. Our model achieved up to 86% accuracy in predicting the amount of time and the amount of memory for both SUMMIT and Beocat HPC resources. Our results show that our model helps dramatically reduce computational average waiting time (from 380 to 4 hours in RMACC Summit and from 662 hours to 28 hours in Beocat); reduced turnaround time (from 403 to 6 hours in RMACC Summit and from 673 hours to 35 hours in Beocat); and acheived up to 100% utilization for both HPC resources.Entities:
Keywords: HPC; Machine Learning; Performance; Scheduling; Slurm
Year: 2021 PMID: 35373221 PMCID: PMC8974354 DOI: 10.1145/3437359.3465574
Source DB: PubMed Journal: Pract Exp Adv Res Comput (2021)
Feature Selected
| Feature | Type | Description |
|---|---|---|
| Account | Text | Account the job ran under. |
| ReqMem | Text | Minimum required memory for the job (in MB per CPU or MB per node). |
| Timelimit | Text | Timelimit set for the job in [DD-[HH:]]MM:SS format. |
| ReqNodes | Numeric | Requested number minimum Node count. |
| ReqCPUS | Numeric | Number of requested CPUs. |
| QOS | Text | Name of Quality of Service. |
| Partition | Text | The partition on which the job ran. |
| MaxRSS | Numeric | Maximum resident set size of all tasks in job (in MB). |
| CPUTimeRAW | Numeric | Time used (Elapsed time * CPU count) by a job (in seconds). |
| State | Text | The job status. |
MARM(Y, X, acc, num_acc m)
| 1 | unacc = unique( |
| 2 | acc_pool = {} |
| 3 | Repeat |
| 4 | |
| 5 | tac = append(acc_pool, i), if not |
| 6 | indices = which |
| 7 | |
| 8 | Repeat 20 times: |
| 9 | Split |
| 10 | RM = Build model using |
| 11 | Calculate training and testing R2 of RM. |
| 12 | |
| 13 | |
| 14 | best_aid = Choose |
| 15 | best_r2_tr = |
| 16 | best_r2_te = |
| 17 | acc_pool = append(acc_pool, best_aid) |
| 18 | Return best_r2_tr, best_r2_te, acc_pool |
Figure 1:R2, RMSE and Runtime of seven methods across 50 accounts in RMACC SUMMIT and 20 accounts in BEOCAT.
Figure 2:R2 versus Number of Accounts in predicting memory and time using MARM across BEOCAT and RMACC Summit
Figure 3:Jobs Submission and Running time (Requested vs Actual vs Predicted) for RMACC Summit Jobs. Note dramatic improvement of Y axis range between graphs.
Figure 6:Jobs Submission and Running time (Requested vs Actual vs Predicted) for Beocat Jobs. Note dramatic improvement of Y axis range between graphs.
Figure 4:Utilization (Requested vs Actual vs Predicted) for RMACC Summit Jobs.
Figure 7:Utilization (Requested vs Actual vs Predicted) for Beocat Jobs.
Figure 5:Backfill-Sched Performance for RMACC Summit Jobs.
Average Waiting and Turnaround Time (Requested vs Actual vs Predicted) For RMACC Summit
| Avg Wait Time (Hour) | Avg TA Time (Hour) | Median Wait Time (Hour) | Median TA Time (Hour) | |
|---|---|---|---|---|
| Requested | 380.6 ±241.2 | 403.14 ±243.3 | 401.2 | 425.7 |
| Actual | 1.3 ±0.7 | 2.9 ±3.2 | 0.5 | 1.1 |
| Predicted | 3.7 ±1.1 | 5.5 ±4.8 | 1.3 | 4.5 |
Average Waiting and Turnaround Time (Requested vs Actual vs Predicted) For Beocat
| Avg Wait Time (Hour) | Avg TA Time (Hour) | Median Wait Time (Hour) | Median TA Time (Hour) | |
|---|---|---|---|---|
| Requested | 662.9 ±193.6 | 673.5 ±196.6 | 681.6 | 652.2 |
| Actual | 1.5 ±1.1 | 4.1 ±2.2 | 0.9 | 3.2 |
| Predicted | 27.7 ±25.3 | 34.8 ±27.1 | 6.2 | 13.9 |