Name: Container-based AI/ML Parallel Workloads in Multi-GPU Cluster System
Start: 2026-04-10T12:45:00+0700
End: 2026-04-10T13:00:00+0700

Container-based AI/ML Parallel Workloads in Multi-GPU Cluster System

Friday April 10, 2026 12:45pm - 1:00pm GMT+07

Benchasiri 1

Authors - Seungmin Lee, Ju-Won Park
Abstract - The module-based static operating environment, which is widely used in domestic and international supercomputer operating centers, encounters numerous problems in supporting artificial intelligence / machine learning (AI/ML) parallel workloads because the variety of platforms and packages used make it difficult to build all execution environments. To address these issues and dynamically provide diverse execution environments, container-based cloud technologies are being widely utilized in high-performance computing (HPC) cluster systems. However, container runtime toolkits like Shifter and Singularity, which are widely used in the HPC field, present problems, such as the need for image format conversion, writing scheduler job script files, environmental setup, and direct management of the container lifecycle. This study proposes a solution to these problems by utilizing Kubernetes, which has become the de facto standard for container orchestration as it supports AI/ML parallel workloads even in HPC environments. Supporting Kubernetes-native parallel workload execution offers several advantages. First, image conversion is unnecessary because it directly uses Docker images. Second, human errors are minimized because the operator automatically handles the environment setup required for parallel execution. Third, in case of failures, automatic recovery and re-execution are possible by leveraging Kubernetes’ powerful container lifecycle management capabilities. In addition, this study introduces the distributed learning function of the KISTI Supercomputer web portal (MyKSC), which has been implemented using the proposed method.

Paper Presenter

Seungmin Lee

Republic of Korea

Friday April 10, 2026 12:45pm - 1:00pm GMT+07
Benchasiri 1 Bangkok Marriott Hotel Sukhumvit, Thailand

Physical Technical Session 1A, Benchasiri 1

11th International Conference on ICTIS

Seungmin Lee

Get help with the event

11th International Conference on ICTIS

Seungmin Lee

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event