Big data: you can hardly pick up a newspaper without reading about some new scientific or business acumen derived from mining some heretofore-untouched volumes of digital information. Well, I’m happy to say that genome sequence data—which certainly qualifies as big, both in volume and velocity—is joining the party, and in a most meaningful way. When combined with information from medical records, genome data can be mined for new insights into treating disease.
Currently, however, genomics analyses are dominated by batch processing, which means that simple analytical questions can take days to answer. A more efficient way to exploit genomics data would be to run queries against a large genome database in the cloud, by using a platform like Windows Azure. In this scenario, the queries are answered across the network in seconds.
Towards this vision, I have been working with researchers at University of California San Diego (UCSD) and have invented the Genome Query Language (GQL), which features three operators that allow error-resilient manipulation of genome intervals. This, in turn, abstracts a variety of existing genomic software tasks, such as variant calling (determining whether a person has a different gene from the reference) and haplotyping (ascribing genomic variation as being inherited from the mother or the father). GQL is inspired by the classic database query language SQL and has similar operators; however, GQL introduces a major new operator: the fault-tolerant union of genomic intervals.
To understand how GQL could be used on the Windows Azure platform in the cloud, imagine that a biologist is working on the ApoE gene, which is responsible for forming lipoproteins in the body. Wondering how ApoE gene variations affect cardiovascular disease (CV), the biologist types in a query with the parameters “ApoE, CV” on a tablet computer, just as you might enter a search-engine query. The query is sent to the GQL implementation in the cloud, which returns the ApoE region of the genome in patients with cardiovascular disease. Since the ApoE gene is quite small, the data is processed quickly in the cloud and returned in seconds to the biologist’s tablet. The biologist can then use customized bioinformatics software to mine the data to identify variations.
We have implemented GQL on Windows Azure and used it to query genomic data expeditiously. We have shown, for example, how GQL can be used to query The Cancer Genome Atlas for large structural variations by using only 5 to 10 lines of high-level code. The code took approximately 60 seconds to execute on the Windows Azure application in the cloud when run on an input human genome file of 83 gigabytes. GQL can improve existing software as well by refactoring queries, significantly speeding up results. It could also be used to facilitate browsing by queries and not just location within the UCSC genome browser.
To make the GQL implementation provide interactive speeds, two optimizations were crucial: cached parsing and lazy joins. Combined, they sped up query processing by a factor of 100. I encourage interested readers to explore the details of our research—the GQL queries we used, the optimizations we implemented, and the experimental results we achieved—in the Microsoft Research Technical Report: Interactive Genomics: Rapidly Querying Genomes in the Cloud.
—George Varghese, Principal Researcher, Microsoft Research