Genomics analysis is an interesting field that has high computational and storage requirements. On top of that, there are compliance requirements, especially if the analysis happens on top of patient clinical data. This makes genomic research activities great candidates to run in an compliant elastic cloud, non other than Azure cloud which even has the recent NEN 7510 standard that is a mandatory requirement for all Netherland organizations that process patient health information.
When it comes to genomics, a big chunk of the workload is related to genome-wide association study (GWAS) which is the study of association between genomes and genetic variants on certain traits or diseases. This is done by mapping and aligning genomes and then running analysis on top of those sequences. The following diagram shows some of the components you may want to consider for an enterprise ready architecture that supports genomics workloads.
Components commonly seen in genomics solutions in Azure (see pptx version) |
Let's start analyzing the diagram from the storage layer. Data coming out of a sequencer can be 100s of Gigabytes per single sample. So when you are up for population genomics, you can easily reach Petabytes and Exabytes scales. Here are some of the common things you will need to address:
- Move data to the cloud. You can use Azure Data Factory (ADF) to bring in data from publicly accessible sources, like other clouds, or you can use Azure Synapse pipelines if you already use it or you want to enable the data exfiltration protection that Azure Synapse Analytics offers.
- Organize and store data. Have a look at https://aka.ms/adls/hitchhikersguide which offers a great guide on the things you need to consider while structuring a data lake. Also make sure you read about the life-cycle management capabilities of blob storage that will help you phase out old research data to reduce the cost of storage and reserved capacity that will reduce the overall storage ownership cost.
- Govern data perhaps with Azure Purview and of course using the compliance enforcements tools available in Azure.
The other approach is to scale out to multiple but smaller VMs, by breaking the job you have to do in multiple smaller tasks. This is where the genomics workflow managers come into play. Workflow managers help scientists scale an existing research process by guiding them in building repeatable and auditable workflows (something you may need, due to regulatory requirements). There are quite a few well-established workflow managers out there:
- Cromwell from the Broad Institute of MIT and Harvard allows you to orchestrate this type of workloads using the Workflow Description Language (WDL) or Common Workflow Language (CDL).
- Galaxy is another workflow manager that is widely used in on-premises deployments and works in Azure as well.
- Snakemake is another popular tool, mainly because scientists use Python to define their workflows.
- Other solutions like nextflow that has great integration with Azure, Parabricks that was acquired by NVIDIA and Bioconductor.
- CycleCloud which is commonly used in hybrid scenarios. Manoj has a great two part blog in his LinkedIn profile showing how you can run Snakemake on top of CycleCloud. If you are interested start from his introductory blog "Power your genomic data analysis on Azure with Azure CycleCloud". Most of the schedulers support CycleCloud. For example, if you already have bash shell scripts for coordinating and submitting jobs to on-premises Slurm cluster, CycleCloud allows you to reuse the same scripts and operationalize them in cloud elasticity. If you are ok having a VM to host the control plane (CycleCloud comes from an on-premises first approach where you host everything), then this is probably a good option.
- Azure Batch is probably the most cost efficient service for running batch jobs. Even Azure Machine Learning (AzureML) is using it behind the scenes to implement scalable training processes. Microsoft has implemented Cromwell TES Batch Scheduler which you can probably port to any workflow manager that support TES (including Snakemake) or you can use dedicated integrations like SnakemakeBurst.
- Azure Kubernetes Service (AKS). Owning an AKS cluster is not trivial and you will need to have at least 3 VMs running. I would advise this option only if you already have an AKS cluster running and you want to utilize the spare compute or if you plan to host other loads in the AKS cluster and thus planning to get the know-how of owning an AKS cluster.
Microsoft has provided quite a few resources for folks to work with genomics is Azure, enabling institutes to build platforms like Terra which eventually can support any type of research activity.
One of these resources is the Cromwell on Azure solution. With this resource, you can kick start easily a Cromwell workflow manager that uses TES as a task scheduler that schedules the jobs on top of Azure Batch. The Microsoft repository provides multiple examples that you can take and customize to your needs, under the Run Common Workflows section of the readme file. The idea is that you dockerize you existing tools (or use the existing docker images) and Cromwell will execute them. For example, if you already have a pipeline that does mutation detection, you can containerize and standardize your pipeline using the Workflow Description Language (WDL) or the Common Workflow Language (CDL) language. Learning WDL is pretty easy and you can leverage this great opensource learning path for WDL. You then establish a process to migrate your data into blob storage where you can build your data providence with native storage account features for security and data protection. The data can be picked up by Cromwell to be processed and the results can be stored back in blobs for down stream analysis. Having the data in a blob, you can easily attach them in a Genomics Data Science VM which has all the downstream analysis tools you may be looking for in python, R or even Bioconductor.
If you are already familiar with the SPARK ecosystem, you can use Glow to do GWAS on top of SPARK.
Finally, if you want to avoid the hassle of building your own platform, you can take advantage of Microsoft's Genomics service, which is a cloud implementation of the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) for secondary analysis. Alternatively, you can look at Microsoft's partners' turn-key solutions like DRAGEN on Azure, provided by illumina which uses the components mentioned above and provides out of the box pre-built pipelines.
As you saw, Azure provides you with a lot of options when it comes to genomics. From basic elastic infrastructure, to genomics notebooks, to individual components that do specific analysis, as well as end-to-end solutions. To decide what is the best fit for your needs, let's look into the following questions:
- How much data do you have?
- What are you looking to do with it? Are you looking to combine it with Electronic Medical Records (EMR) data or imaging data like Digital Imaging and Communications in Medicine (DICOM) data?
- What is the current state of your environment? Do you already have a pipeline? Are you looking to scale out the existing pipeline to handle more data?
Once you have those answers you can decide which Azure components you will need. Feel free to use the Azure Icons Pack and my PowerPoint with the schemas shown in this blog post to start designing your own architectures.
Looking forward seeing what type of solution you came up with, in your genomic journey!
1 comment:
Hello Team, You have neatly written this "Genomics in Azure." In this synopsis, you described that Genomics analysis was constructive to understand. It was beneficial for me. Thanks for sharing your Knowledge!!!
Post a Comment