How Cloud High-Performance Computing Solution on Cloudam Helps Tsinghua AI Labs to Train Large AI Model.
Tsinghua University AI Labs (hereafter Tsinghua) is one of the most famous AI Institutes in China, which does AI research with the most advanced algorithm and with the most significant model size; Tsinghua University itself is the most renowned university in China as well, which usually has the reputation as the NO.1 Chinese university worldwide.
Challenges with AI Training On-premise
Tsinghua itself has an AI cluster with around 1,000 GPU cards, fully connected with a 200Gbps InfiniBand network, which needs to support all researchers' and students’ training jobs. The cluster is managed with a SLURM job management system and resources are allocated per team basis.
Due to the cluster's overload, currently, each team is able to allocate a maximum of 50 GPU cards, and the average job pending time is 2 hours to 3 days when running on-premise. On the other hand, during AI training, it often needs to download a large volume of datasets from the internet. However, the on-premise cluster is totally isolated from the internet because of security considerations.
These are major challenges to the Tsinghua AI Labs teams, in addition to these challenges, they are working on cutting-edge AI models, especially in the NLP area, with 100 billion parameters. Generally, one training cycle would need 1000 V100 GPU cards to run for one month, the existing on-premise cluster definitely cannot cope with it.