Genomics Pipeline Management System

With the development of next-generation sequencing (NGS,) more institutions are looking to leverage genomic sequencing for both academic research purposes as well as clinical cancer diagnostic assistance. In clinical applications, this can take the form of pipelines that translate some or all of the process by which raw base call images are used to generate annotated variant caller files for bioinformaticists to provide diagnostic to the medical community. For this work, we describe the control framework used to manage this workflow for our clinical operations.

The complex, multifaceted computational requirements of NGS require large amounts of processing power while also working on tightly controlled, mostly contained processes. As such, this system lends itself well to cloud computing infrastructure in which resources can be allocated as required. For our work, this has taken the form of both a private cloud, backed by OpenStack and Ceph, as well as the public cloud (Amazon Web Services) using encrypted architecture. Additionally, as the work must be reproducible, we leverage Docker application containerization to precisely control the processing environment. Finally, we also leverage the BagIt specification for archiving data developed by the Library of Congress. We collectively refer to this system as the Genomic Processing Management System (GPMS.)

This work was presented at the 2019 Association of Pathology Informatics Summit.

GPMS Architecture