Botsikas' Blog: Genomics in Azure

Genomics analysis is an interesting field that has high computational and storage requirements. On top of that, there are compliance requirements, especially if the analysis happens on top of patient clinical data. This makes genomic research activities great candidates to run in an compliant elastic cloud, non other than Azure cloud which even has the recent NEN 7510 standard that is a mandatory requirement for all Netherland organizations that process patient health information.

When it comes to genomics, a big chunk of the workload is related to genome-wide association study (GWAS) which is the study of association between genomes and genetic variants on certain traits or diseases. This is done by mapping and aligning genomes and then running analysis on top of those sequences. The following diagram shows some of the components you may want to consider for an enterprise ready architecture that supports genomics workloads.

Components commonly seen in genomics solutions in Azure (see pptx version)

Let's start analyzing the diagram from the storage layer. Data coming out of a sequencer can be 100s of Gigabytes per single sample. So when you are up for population genomics, you can easily reach Petabytes and Exabytes scales. Here are some of the common things you will need to address:

Move data to the cloud. You can use Azure Data Factory (ADF) to bring in data from publicly accessible sources, like other clouds, or you can use Azure Synapse pipelines if you already use it or you want to enable the data exfiltration protection that Azure Synapse Analytics offers.
Organize and store data. Have a look at https://aka.ms/adls/hitchhikersguide which offers a great guide on the things you need to consider while structuring a data lake. Also make sure you read about the life-cycle management capabilities of blob storage that will help you phase out old research data to reduce the cost of storage and reserved capacity that will reduce the overall storage ownership cost.
Govern data perhaps with Azure Purview and of course using the compliance enforcements tools available in Azure.

On the compute layer, if you are looking for a research environment where you have some extra compute, you can leverage the Genomics Data Science VM which deploys a windows or linux Virtual Machine (VM) with all sort of common data science tools and genomics specific resources, like sample notebooks and common genomics workflows that can run on top of Bioconductor. This VM will get you started in minutes but it comes with all the pain of managing the VM, updating the operating system, backing up your configuration etc. Moreover, the whole processing is done within the same VM, so you can only scale up to more powerful and also more expensive VM SKUs.

The other approach is to scale out to multiple but smaller VMs, by breaking the job you have to do in multiple smaller tasks. This is where the genomics workflow managers come into play. Workflow managers help scientists scale an existing research process by guiding them in building repeatable and auditable workflows (something you may need, due to regulatory requirements). There are quite a few well-established workflow managers out there:

Cromwell from the Broad Institute of MIT and Harvard allows you to orchestrate this type of workloads using the Workflow Description Language (WDL) or Common Workflow Language (CDL).
Galaxy is another workflow manager that is widely used in on-premises deployments and works in Azure as well.
Snakemake is another popular tool, mainly because scientists use Python to define their workflows.
Other solutions like nextflow that has great integration with Azure, Parabricks that was acquired by NVIDIA and Bioconductor.

Most of those workflow managers, break the job in smaller tasks and then pass them to a task scheduler to log necessary auditing info, execute and monitor the actual tasks. Task schedulers are either implementations on top of the Task Execution Service (TES) API, Slurm Workload Manager or OpenPBS, all of which are schedulers coming from the High Performance Computing (HPC) field. Most of those schedulers integrate well with Azure native resources like:

CycleCloud which is commonly used in hybrid scenarios. Manoj has a great two part blog in his LinkedIn profile showing how you can run Snakemake on top of CycleCloud. If you are interested start from his introductory blog "Power your genomic data analysis on Azure with Azure CycleCloud". Most of the schedulers support CycleCloud. For example, if you already have bash shell scripts for coordinating and submitting jobs to on-premises Slurm cluster, CycleCloud allows you to reuse the same scripts and operationalize them in cloud elasticity. If you are ok having a VM to host the control plane (CycleCloud comes from an on-premises first approach where you host everything), then this is probably a good option.
Azure Batch is probably the most cost efficient service for running batch jobs. Even Azure Machine Learning (AzureML) is using it behind the scenes to implement scalable training processes. Microsoft has implemented Cromwell TES Batch Scheduler which you can probably port to any workflow manager that support TES (including Snakemake) or you can use dedicated integrations like SnakemakeBurst.
Azure Kubernetes Service (AKS). Owning an AKS cluster is not trivial and you will need to have at least 3 VMs running. I would advise this option only if you already have an AKS cluster running and you want to utilize the spare compute or if you plan to host other loads in the AKS cluster and thus planning to get the know-how of owning an AKS cluster.

Let's zoom into the Spoke Virtual Network reference on the diagram. Due to the sensitivity of the data handled in such solutions, it is common to see networking isolation on top of the identity protection that Azure Active Directory provides. The hub-spoke network topology is widely adopted by enterprises to allow them scale within cloud. Smaller enterprises may start with the Trey Research reference architecture and scale from there.

With the network isolation in place, scientists need to get access to the isolated data and the various resources. If the scientists already have their own devices to support them on their daily work, they can re-use that investment and connect from their machines to the private network with technologies like site to site VPN and Express route if they were working from on premises environment. Alternatively, scientists could connect with ad-hoc point to site VPN if they are working remotely. These approaches assume that the enterprise has already got a way to attest for the health of the devices, something that can be imposed using Azure Active Directory's Conditional Access. If the scientists don't have their own devices, then they can either leverage the VM solution mentioned above and get access to it through Azure Bastion or the enterprise can adopt Azure Virtual Desktop, probably with a personal desktop assignment approach.

Microsoft has provided quite a few resources for folks to work with genomics is Azure, enabling institutes to build platforms like Terra which eventually can support any type of research activity.

One of these resources is the Cromwell on Azure solution. With this resource, you can kick start easily a Cromwell workflow manager that uses TES as a task scheduler that schedules the jobs on top of Azure Batch. The Microsoft repository provides multiple examples that you can take and customize to your needs, under the Run Common Workflows section of the readme file. The idea is that you dockerize you existing tools (or use the existing docker images) and Cromwell will execute them. For example, if you already have a pipeline that does mutation detection, you can containerize and standardize your pipeline using the Workflow Description Language (WDL) or the Common Workflow Language (CDL) language. Learning WDL is pretty easy and you can leverage this great opensource learning path for WDL. You then establish a process to migrate your data into blob storage where you can build your data providence with native storage account features for security and data protection. The data can be picked up by Cromwell to be processed and the results can be stored back in blobs for down stream analysis. Having the data in a blob, you can easily attach them in a Genomics Data Science VM which has all the downstream analysis tools you may be looking for in python, R or even Bioconductor.

If you are already familiar with the SPARK ecosystem, you can use Glow to do GWAS on top of SPARK.

Finally, if you want to avoid the hassle of building your own platform, you can take advantage of Microsoft's Genomics service, which is a cloud implementation of the Burrows-Wheeler Aligner (BWA) and the Genome Analysis Toolkit (GATK) for secondary analysis. Alternatively, you can look at Microsoft's partners' turn-key solutions like DRAGEN on Azure, provided by illumina which uses the components mentioned above and provides out of the box pre-built pipelines.

As you saw, Azure provides you with a lot of options when it comes to genomics. From basic elastic infrastructure, to genomics notebooks, to individual components that do specific analysis, as well as end-to-end solutions. To decide what is the best fit for your needs, let's look into the following questions:

How much data do you have?
What are you looking to do with it? Are you looking to combine it with Electronic Medical Records (EMR) data or imaging data like Digital Imaging and Communications in Medicine (DICOM) data?
What is the current state of your environment? Do you already have a pipeline? Are you looking to scale out the existing pipeline to handle more data?

Once you have those answers you can decide which Azure components you will need. Feel free to use the Azure Icons Pack and my PowerPoint with the schemas shown in this blog post to start designing your own architectures.

Looking forward seeing what type of solution you came up with, in your genomic journey!

Botsikas' Blog

Wednesday, December 1, 2021

Genomics in Azure

1 comment:

Blog Archive

Contributors