by Elimu Informatics
In their November 2017 report, the Office of the National Coordinator’s Sync for Genes project emphasizes the need for “pilots that test FHIR Genomics for Genomic Archiving and Communication System (GACS) integration with EHRs”. We support this direction, both as a viable mechanism for greater genomics-EHR integration, and because it opens up some pretty interesting possibilities for genomics clinical decision support (CDS). In this blog we’ll take a closer look at these possibilities.
A GACS stores sequence data generated from a sequencing laboratory and is analogous in many ways to a Picture Archiving and Communication System (PACS), which stores image files that are too large to store directly in an EHR. Traditional laboratories have focused on communicating simple categorical results and interpretative information. Some of these laboratories now need to provide more provenance and underlying raw data for potential reanalysis and other scenarios. Because of that, they are turning to a GACS type of solution to store all of that data. Consider for example, a lab that performs Whole Genome Sequencing (WGS). WGS studies can identify millions of variations between a patient’s DNA and a reference DNA, and typically only a very small handful or these (e.g. those known to be pathogenic) are returned to the EHR. The unused millions of variations for that single test will need to reside in a GACS in case they are needed in the future. As a result, it will be increasingly necessary to standardize the way in which info from a GACS is retrieved by an EHR.
What type of data might we find in a GACS? An innovative prototype being developed at Nationwide Children’s describes a data source layer consisting of FASTQ, BAM, and VCF formatted files. Many of us in the clinical informatics space are less familiar with these bioinformatics file formats, but they are the bread and butter of next-generation sequencing data, with an enormous and vibrant user base. FASTQ files store raw sequencing reads that comes off a sequencer. Reads in a FASTQ file can be aligned to a reference sequence, resulting in a sequence alignment file, stored in BAM format. Once the raw reads have been aligned to a reference sequence, it becomes possible to identify variants (differences between the raw reads and the reference), which are stored in a VCF file. The Global Alliance for Genomics & Health (GA4GH) maintains these file formats, much like Health Level Seven maintains the FHIR standard.
How might a GACS be architected in to a genomics CDS application? A possible schematic is shown in the figure below.
In this model, a CDS application communicates with an EHR via CDS-Hooks and FHIR. Events in the EHR trigger CDS rules, and in order to compute certain rules (such as looking for drug-gene interactions), the CDS may need patient-specific genomic data, only some of which may be in the EHR. In addition, one could imagine the CDS application being triggered by external events, such as the recategorization of a variant in an external knowledge base. Such a recategorization could trigger the CDS application to search GACS for patients that may need follow up.
Raw sequencing data of the type we might expect to see in a GACS (e.g. FASTQ, BAM, VCF) and EHR data standards (e.g. FHIR) have evolved independently, posing interoperability and integration challenges. Several open source algorithms (such as FHIR-Converter, NCBI Variation Services, PharmCAT, VariantValidator) can transform complex bioinformatics data into bite-sized pieces of information that can be used clinically. However, CDS applications will likely have requirements that go beyond today’s algorithms - what regions of the LDLR gene were studied? is the identified CFTR delta F508 mutation hetero or homozygous? are the identified TPMT variants cis or trans? what CYP2D6 star alleles are present?
GA4GH maintains the htsget protocol, designed for efficient access to sequencing data. It’s an interesting and powerful model - huge raw data files live in a GACS, and in response to a query, only the small amount of data that is relevant is extracted and returned. Likewise, we envision the GACS wrapped with an on-demand ‘FHIR Translator’, similar in spirit to htsget, except that query and response are FHIR formatted. With this approach:  raw genomics data is stored in a GACS;  GACS receives a FHIR-based query;  GACS searches the raw data and extracts the relevant portions;  GACS performs on-demand translation, to return FHIR objects.
Viewed in another way, bioinformaticists typically think of sequencing data as being processed through a ‘pipeline’, such as the one defined by the Broad Institute. In this ‘reads-to-variants’ workflow, raw sequencing reads that comes off a sequencer are the input to a set of defined computational processes designed to optimize variant discovery and genotyping. Here, we extend the notion of ‘pipeline’, just a bit, by adding a terminal and on-demand FHIR translation.
Much work remains to be done to achieve a GACS with a standardized FHIR interface for CDS, but we see such an architecture as likely playing a role in future genomics clinical decision support applications.