top of page

Run Gene Analysis HPC Task on AWS Using Cloudam

How to run gene analysis HPC tasks on AWS with just a few steps without complex software deployment?


Background


In the past three decades, life sciences and computer sciences have both developed rapidly. Whereas, Bioinformatics is a cutting-edge interdisciplinary between life sciences and computer sciences. The main driving force for the emergence and rapid development of bioinformatics comes from the increasing application of high-throughput technologies such as new-generation sequencing in the field of life sciences. Genomics is a major example of this trend, in which high throughput next-generation sequencing (NGS) devices are used to sequence DNA, mRNA, regulatory regions, intestinal microbiome, etc. Computational workflows are also being rapidly developed and standardized, which support dynamic expansion. With the collection of a large amount of genomic data, the processing time is usually in the order of billions of core hours, and the processing cost increases accordingly. Therefore, customers are looking for optimization tools and systems with the shortest running time and the lowest cost. There are usually two ways - the first is to build a local computing cluster. On the one hand, it is expensive to build a local large computing cluster. On the other hand, the capacity of peak load is limited, the project cycle is relatively long, and the upfront investment is large. The second option is to build a cloud-HPC platform with cloud resources, which also provides fast access to the latest technologies and resources, including the latest graphic cards or the latest generation of processors that reduce the time required for computing. By selecting the appropriate instance type, the overall computing time can be shortened.


In this blog, we will show how to use Cloudam cloud-HPC platform to run genetic analysis HPC tasks on AWS.


Overview


This guideline focus on starting a slurm cluster based on Amazon EC2 on Cloudam console. This cluster provides one login node, on which you can quickly configure multiple gene analysis tasks by easily configuring AWS S3 storage.


Precondition


Before starting to use Cloudam, you need to make the following preparations:

1. 1 pair of AWS AK / SK with access to the specified S3 bucket.

2. S3 bucket for storing calculation input files.

3. The S3 bucket for storing the calculation result file can also be distinguished by using different directories of the input file bucket.


You can quickly create S3 buckets and upload input files through the AWS S3 console. If you already have buckets, you can skip this step. This is a simple process and only involves four steps:

1. Log in to the AWS S3 console.

2. Create an S3 bucket.

3. Set bucket permissions, recommend private read and write or use S3 bucket ACL strategy for more fine grain control.

Only the specified IAM role ROLENAME can access resources under the specified bucket by ACL (optional step)

Reference: https://aws.amazon.com/cn/blogs/security/how-to-restrict-amazon-s3-bucket-access-to-a-specific-iam-role/

4. Upload input files.


Workflow


Users only need to prepare the S3 bucket to store input and output files, without worrying about the internal scheduling details of the cluster, which is very easy to use.