| Literature DB >> 35072060 |
Alessandro Di Girolamo1, Federica Legger2, Panos Paparrigopoulos1, Jaroslava Schovancová1, Thomas Beermann3, Michael Boehler4, Daniele Bonacorsi5,6, Luca Clissa5,6, Leticia Decker de Sousa5,6, Tommaso Diotalevi5,6, Luca Giommi5,6, Maria Grigorieva7, Domenico Giordano1, David Hohn4, Tomáš Javůrek1, Stephane Jezequel8, Valentin Kuznetsov9, Mario Lassnig1, Vasilis Mageirakos1, Micol Olocco2, Siarhei Padolski10, Matteo Paltenghi1, Lorenzo Rinaldi5,6, Mayank Sharma1, Simone Rossi Tisbeni11, Nikodemas Tuckus12.
Abstract
As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on "smart" solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.Entities:
Keywords: HL-LHC; ML; NLP; distributed computing operations; operational intelligence; resources optimization
Year: 2022 PMID: 35072060 PMCID: PMC8776639 DOI: 10.3389/fdata.2021.753409
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
FIGURE 1(A) Time-to-start in minutes of HammerCloud functional test jobs used for auto-exclusion and re-inclusion into the ATLAS Grid WFMS. (B) Example histogram from 2020-11-24. Number of job shaping actions every 30 min (empty: increase; filled: decrease) of the parallel running test jobs.
FIGURE 2Example of an error message cluster summary.
FIGURE 3Time evolution of cluster 0: the plot shows the count of errors in bins of 10 min.
List of OpInt projects deployed in production and under development, with their current status.
| Project | Status | Detail |
|---|---|---|
| Intelligent alert system | In production in one experiment; concept in development in another one |
|
| Jobs Buster | In production in one experiment |
|
| FTS log clustering | In production in one experiment; concept in development in another one |
|
| HammerCloud job shaping | In production in one experiment |
|
| Shared k8s cluster | Infrastructure deployed, wider adoption by two experiments in progress |
|
| Cloud anomaly detection | Infrastructure and algorithms prototyped, commissioning in progress |
|
| FTS anomaly detection | Prototype developed in one experiment; generalization and adoption by another experiment in progress |
|
| Predictive site maintenance | Code in development |
|