How to run gene analysis HPC tasks on AWS with just a few steps without complex software deployment?

Background
In the past three decades, life sciences and computer sciences have both developed rapidly. Whereas, Bioinformatics is a cutting-edge interdisciplinary between life sciences and computer sciences. The main driving force for the emergence and rapid development of bioinformatics comes from the increasing application of high-throughput technologies such as new-generation sequencing in the field of life sciences. Genomics is a major example of this trend, in which high throughput next-generation sequencing (NGS) devices are used to sequence DNA, mRNA, regulatory regions, intestinal microbiome, etc. Computational workflows are also being rapidly developed and standardized, which support dynamic expansion. With the collection of a large amount of genomic data, the processing time is usually in the order of billions of core hours, and the processing cost increases accordingly. Therefore, customers are looking for optimization tools and systems with the shortest running time and the lowest cost. There are usually two ways - the first is to build a local computing cluster. On the one hand, it is expensive to build a local large computing cluster. On the other hand, the capacity of peak load is limited, the project cycle is relatively long, and the upfront investment is large. The second option is to build a cloud-HPC platform with cloud resources, which also provides fast access to the latest technologies and resources, including the latest graphic cards or the latest generation of processors that reduce the time required for computing. By selecting the appropriate instance type, the overall computing time can be shortened.
In this blog, we will show how to use Cloudam cloud-HPC platform to run genetic analysis HPC tasks on AWS.
Overview
This guideline focus on starting a slurm cluster based on Amazon EC2 on Cloudam console. This cluster provides one login node, on which you can quickly configure multiple gene analysis tasks by easily configuring AWS S3 storage.
Precondition
Before starting to use Cloudam, you need to make the following preparations:
1. 1 pair of AWS AK / SK with access to the specified S3 bucket.
2. S3 bucket for storing calculation input files.
3. The S3 bucket for storing the calculation result file can also be distinguished by using different directories of the input file bucket.
You can quickly create S3 buckets and upload input files through the AWS S3 console. If you already have buckets, you can skip this step. This is a simple process and only involves four steps:
1. Log in to the AWS S3 console.
2. Create an S3 bucket.
3. Set bucket permissions, recommend private read and write or use S3 bucket ACL strategy for more fine grain control.
Only the specified IAM role ROLENAME can access resources under the specified bucket by ACL (optional step)
Reference: https://aws.amazon.com/cn/blogs/security/how-to-restrict-amazon-s3-bucket-access-to-a-specific-iam-role/
4. Upload input files.
Workflow
Users only need to prepare the S3 bucket to store input and output files, without worrying about the internal scheduling details of the cluster, which is very easy to use.

1) Register and log in to Cloudam
You may need to register your Cloudam account first. There is a $30 Free Trial for every new user, don't miss it!
2) Create Workspace
Workspace is a virtual region created by Cloudam for you on AWS, corresponding to a certain region of AWS, and the subsequent AWS-related services and resources are configured under this region.
You need to upgrade your account to the Enterprise version for free before you use this function. For more details, please contact us.
You need to choose the same region where the data is stored to create a Workspace. The advantage of choosing the same region is that the data can be uploaded and downloaded between AWS EC2 and S3 and can be accessed using the intranet, which is faster and safer.
Cloudam supports creating multiple Workspaces for free which allows different R&D teams to use adjacent AWS resources.

3) Create and log in to the cluster login node
After logging into Cloudam, you need to select a Workspace that is consistent with or adjacent to your data storage region, and the subsequent use of EC2 clusters and other resources will be configured under this region.
You can log in directly through the browser through webssh, or you can connect to the cluster login node through other connection tools such as xShell.
A virtual user named Cloudam is built into the cluster login node and can be used to submit genetic analysis tasks.
4) Configure AWS AK / SK
AWS CLI stores sensitive credential information specified with AWS configure in a local file named .aws in the "credentials" in the home directory.
Set the configuration by typing AK/SK in the cluster login node according to the notices.
aws configure
Disclaimer: Cloudam will not access the user's EC2 without the user's authorization, nor will it obtain the user's data. Users need to sign an electronic legal agreement before using the service.
5) Prepare the script of the job script
Cloudam has pre-installed 300+ software to start an HPC job with just a few steps. Since IT staff no longer needs to install and configure the software running environment. If you need software customization on Cloudam, please free to contact us via email or LiveChat.
Take the commonly used sequencing comparison software Blast+ as an example:
Use the vim editor to write your job calculation script
vim job.sbatch
#! / bin / bash
# SBATCH --Job-name = example / / job name
# SBATCH --partition c-64-1 / / hardware type 64 core 64G
# SBATCH --ntasks = 64 / number of tasks
# Download your input file from s3
Aws s3 cp --quiet s3: / / genomics-cloudam / input.tar.gz / home / cloudam /
tar-zxvf input.tar.gz
# Load blast +
module added BLAST + / 2.2.31
# Submit blast + task, related parameters need to be replaced with actual parameter values
blastx -i < input-file > -o < output-file > < other-options > -num _ threads < num-threads >
# Upload the result file to s3
tar -zcvf result.tar.gz / home / cloudam / result
aws s3 cp --quiet / home / cloudam / result.tar.gz s3: / / genomics-cloudam /
...
6) Submit a job
Sbatch job.sbatch
After the job is done, post-processing will be performed, the result will be written into the S3 bucket, the idle EC2 server will be destroyed and the billing will be stopped immediately.
Congratulations! The genetic analysis task has been successfully run with AWS on Cloudam. There are more functions and possibilities waiting for you to explore.
Comparison between Cloudam Platform and AWS ParallelCluster
AWS ParrallelCluster vs. Cloudam
| AWS ParallelCluster | Cloudam |
Data Security | Data is under the account | No retention of user data, data is landed on user's account, no worry about data security |
Cost | Cost on EC2, Network and Storage | On-demand cost on EC2 only |
Functions | Only basic computing resources. | In addition to massive computing resources, Cloudam provides a series of visualization functions such as file transfer, imagecenter, dataset, team collaboration, quota management, operation audit, billing report, security management, system management and other functions, and provides devoted technical support services |
Usability | Cluster O&M needs professional IT staff, all software needs manual installation and environment configuration, and all operation is executed with command line. | Easy configuration without manual operation from IT staff; 300+ software preinstalled; Various submission methods - template, command line & desktop workstations |
Conclusion
Above is the hands-on tutorial on how to submit genetic analysis tasks on AWS using Cloudam. For more tutorials, you can head to the user manual to test with more demos.
About Cloudam
Cloudam HPC is a one-stop HPC platform with 300+ pre-installed to deploy immediately. The system can smartly schedule compute nodes and dynamically schedule the software licenses, optimizing workflow and boosting efficiency for engineers and researchers in Life Sciences, AI/ML, CAE/CFD Simulations, Universities/Colleges, etc.
Partnered with AWS, Azure, Google Cloud, Oracle Cloud, etc., Cloudam powers your R&D with massive cloud resources without queuing.
You can submit jobs by intuitive templates, SLURM, and Windows/Linux workstations. Whether you are a beginner or a professional, you can always find it handy to run and manage your job.
There is a $30 Free Trial for every new user. Why not register and boost your R&D NOW?