How Cloud High-Performance Computing Solution on Cloudam Helps Tsinghua AI Labs to Train Large AI Model.
Tsinghua University AI Labs (hereafter Tsinghua) is one of the most famous AI Institutes in China, which does AI research with the most advanced algorithm and with the most significant model size; Tsinghua University itself is the most renowned university in China as well, which usually has the reputation as the NO.1 Chinese university worldwide.
Challenges with AI Training On-premise
Tsinghua itself has an AI cluster with around 1,000 GPU cards, fully connected with a 200Gbps InfiniBand network, which needs to support all researchers' and students’ training jobs. The cluster is managed with a SLURM job management system and resources are allocated per team basis.
Due to the cluster's overload, currently, each team is able to allocate a maximum of 50 GPU cards, and the average job pending time is 2 hours to 3 days when running on-premise. On the other hand, during AI training, it often needs to download a large volume of datasets from the internet. However, the on-premise cluster is totally isolated from the internet because of security considerations.
These are major challenges to the Tsinghua AI Labs teams, in addition to these challenges, they are working on cutting-edge AI models, especially in the NLP area, with 100 billion parameters. Generally, one training cycle would need 1000 V100 GPU cards to run for one month, the existing on-premise cluster definitely cannot cope with it.
Committing to Cloudam
To address these challenges, Tsinghua chose Cloudam HPC to meet the AI training scale as well as the HPC service, Cloudam helped Tsinghua to set up a pure elastic AI cluster on public clouds, which consists of the below features:
1. Large AI Cluster
The Cloudam cluster is able to spin up to 3000 V100 & A100 GPU cards;
The cluster uses SLURM as the job scheduler, the same as the on-premise cluster;
The cluster is fully managed by Cloudam team, and Tsinghua AI teams could 100% focus on training.
2. Automated scheduling
Scheduling the latest hardware including NVIDIA A100 and V100 GPU;
Zero queues during peak demand;
Automatically terminate hardware resources once no jobs in the node, always pay-as-training-goes.
3. Cross-regional cooperation
The NLP model needs cross-regional cooperation with other scientists worldwide, since the cluster is on the public cloud, it’s relatively easy to cooperate with each other by just accessing the cloud through the internet.
Reduce Dataset Access Time
The NLP model needs to use datasets from the internet which are sized at 600TB, all are unstructured files from different websites.
In order to reduce such datasets' download time, Cloudam uses the object storage to mirror the above datasets, and then uses the private network to download them into the cluster, the mirroring procedure took around 1 week. Thanks to the 100Gbps network, the download time is further reduced to another 1 week, which is 5 times faster than the Tsinghua team’s most optimistic thought.
Impacts on the Re-thinking of AI training
Such work significantly impressed the Tsinghua team afterward, the executive team of the AI labs revised the result and rethink the strategy of how it shall work on the AI infrastructure, on-premise vs on-cloud. Eventually, they made a decision to invest more in cloud AI infrastructure, rather than the static on-premise local cluster.
About Cloudam HPC
Cloudam HPC is a one-stop HPC platform with 300+ pre-installed to deploy immediately. The system can smartly schedule compute nodes and dynamically schedule the software licenses, optimizing workflow and boosting efficiency for engineers and researchers.
Partnered with AWS, Azure, Google Cloud, Oracle Cloud, etc., Cloudam powers your R&D with massive cloud resources without queuing.
You can submit jobs by intuitive templates, SLURM, and Windows/Linux workstations. Whether you are a beginner or a professional, you can always find it handy to run and manage your job.
There is a $30 Free Trial for every new user. Why not register and boost your R&D NOW?