top of page

Boost Protein Structure Prediction in AlphaFold2 with Cloud HPC

AlphaFold2 is a groundbreaking AI software developed by DeepMind, Inc. On November 30, 2020, the software predicted most protein structures in the protein structure prediction competition CASP 14 that were only one atom's width away from the real structure, reaching the level of predictions observed by humans using sophisticated instruments such as cryo-electron microscopy, which is an unprecedented and tremendous progress in protein structure prediction.


boost protein structure prediction in AlphaFold2 with AWS, Google Cloud, Azure high performance computing

The source code of AlphaFold2 is now publicly available on GitHub. Many scientists are now using AlphaFold2 to make high-throughput predictions on existing protein databases, building a database of AlphaFold2 predicted structures for all proteins of some model organism species. (https://alphafold.ebi.ac.uk/)



We can see that although all protein databases are not covered although multiple databases of different species are predicted with AlphaFold2. Only by building AlphaFold locally, can we predict the protein structures that we are interested in anytime.


This article elaborates on 2 approaches to using AlphaFold2. On the one hand, you can log in to Cloudam where AlphaFold2 is pre-installed, and submit protein structure jobs in just a minute. Moreover, the template submission is very user-friendly to beginners who are not that familiar with command lines. On the other hand, AlphaFold2 can also be installed locally. However, this requires a more comprehensive skillset in Linux, and errors may occur from time to time.


Predict Protein Structure in AlphaFold2 with Cloudam


Video tutorial of how to run AlphaFold2 on cloud-HPC platform


1. Find AlphaFold2 on Cloudam


After logging in, it is pretty easy to find Alphafold2 in the 'Applications' center. Click on it and then choose 'Submit'.



2. Select Visualization Template


It is highly recommended to use the Visualization Template since jobs can be efficiently submitted with just a few clicks setting a few parameters.



3. Submit a job with a few clicks


Upload a .fasta file. Select 'monomer' if the input file is monolithic, and "multimer" if it is polymorphic.



After setting up, you can choose the GPUs according to your needs.



You can submit the job after checking the summary of the job.



4. Check the job and its result


The job can be checked and monitored on 'My jobs'.



Predict Protein Structure in AlphaFold2 Locally


1. Hardware requirements

  • A disk with at least 3T storage, since the trained dataset in AlphaFold2 is about 428GB, which requires 2.2T storage after decompression. If you use reduced-dbs (a simplified dataset), it also requires at least 600GB storage.

  • 12 vGPUs

  • 85GB memory and above

  • One NIVIDA A100 or V100 graphic card


2. Download dataset, program and models


Firstly, the AlphaFold2 on Github needs to be downloaded (https://github.com/deepmind/alphafold) to a local directory. Then navigate into the script folder, run the command download_all_data.sh <DOWNLOAD_DIR>, the program will be downloaded automatically.


Since the file is 438GB, so it is going to be a long time. Meanwhile, if there is any disconnection, you have to restart and download from the very beginning. It is recommended that download the file with multiple computers other than run the main program directly. Of course, you can use download tools to download the files in advance, then copy the file to the server for decompression.


All files can be downloaded except for pdb_mmcif, why? Because pdb website doesn't provide any compressed mmcif dataset. Each one of them is a small file, which requires syncing the dataset from pdb server to the local server. It is recommended to execute the individual script to download on the installment directory, or it will take a long time and a lot of effort to copy and compress for it contains 180 thousand cif files.



After decompression, please check if the size and name of each file match the list in the above picture.


Notice: bfd folder and small_bfd folder are mutually exclusive, so only one folder should be kept. bfd is a comprehensive dataset while small_bfd is a simplified one. If your disk storage is insufficient, you can choose the later one.


3. Install Docker and NVIDIA Container Toolkit


3.1 Install Docker


Please refer to the official tutorial from Docker: https://docs.docker.com/desktop/install/linux-install/


3.2 Install NVIDIA Container Toolkit


Please refer to the official tutorial from NVIDIA: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html


3.3 Run a test


root permission run:


docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

If the info runs as below, it means that it is successfully installed.



4. Run AlphaFold2


4.1 Configure the input/output directory


You need to configure the input and output directory first. Open the run_docker.py script under the docker file, then change the configuration of DOWNLOAD_DIR to fasta file, finally change the output_dir as the output directory.


4.2 docker build


docker build -f docker/Dockerfile -t alphafold


4.3 Install python virtual environment


If you use python3 and also have pip3 on your server. You can:


pip3 install -r docker/requirements.txt

4.4 Run AlphaFold2


python3 docker/run_docker.py --fasta_paths=输入序列文件完整路径 --max_template_date=2020-05-14 --preset=[reduced_dbs、full_dbs、casp14]


fasta_paths: the name of the protein prediction fasta file


max_template_date: If you predict the protein is in the pdb, but you don't want to use the pdb to make a template, you can use the date to restrict the pdb as a template, the date should be prior to the release date of the protein.


preset: the average consideration about time and prediction quality - reduced_dbs is the fastest but with the worst quality; full_dbs is with medium speed and quality; while casp14 is in best quality with the time which is 8 time the length of full_dbs.


4.5 Check the result


After running, a series of files will be generated in your output_dir, among which ranked_0 to 4 are the 5 models predicted with the highest scores with AlphaFold2. The score 0 means the best, the larger the number, the lower the credibility.


About Cloudam


Cloudam is a one-stop cloud-HPC platform with 300+ pre-installed to deploy immediately. The system can smartly schedule compute nodes and dynamically schedule the software licenses, optimizing workflow and boosting efficiency for engineers and researchers.


Partnered with AWS, Azure, Google Cloud, Oracle Cloud, etc., Cloudam powers your R&D with massive cloud resources without queuing.


You can submit jobs with intuitive templates, SLURM, and Windows/Linux workstations. Whether you are a beginner or a professional, you can always find it handy to run and manage your jobs.


There is a $30 Free Trial for every new user. Why not register and boost your R&D NOW?


bottom of page