Are easy access and security in genome analysis compatible? (Yes!)
Human genome data can be analyzed to derive immense healthcare benefits, and equally misused with disastrous consequences for victims of its theft. While we want to analyze large quantities of genetic data to maximize our knowledge, we need to keep this data completely secure. A new framework for genomic data analysis using brand-new mathematical techniques for data encryption gives both freedom and security.
Every body is different – medically speaking
New discoveries are rapidly being made linking genetic variants with drug interactions and with disease risks in healthcare. Driven by these discoveries is the field of precision medicine, which means managing or treating patients based on their metabolism and risk factors, rather than by applying uniform solutions to all patients.
We know that certain drugs that successfully treat illness in some people are ineffective for other people. The root-cause of these variations in effectiveness may lie in genetic differences from person to person. Current findings in pharmacogenomics, the study of how variants in genes determine drug metabolism, tell us that about half of all primary care patients are exposed to drugs whose metabolism is affected by genes  . Studies have found that 18% of the 4 billion prescriptions written in the US per year are affected by such genetic variations .
Sequence data are also used to predict disease risk. For instance, the American College of Genetics and Genomics (ACMG) recommends reporting secondary findings in 56 genes . As many as 7% of patients harbor one of nearly 19,000 pathogenic or likely pathogenic variants in these 56 ACMG genes –. Early detection of the conditions caused by these genes will allow clinicians to act on them and reduce the risk to the patient.
The cost of whole genome sequencing has decreased dramatically in recent years and is expected to continue to decrease. Currently, next generation sequencing of a whole genome is a fraction of the cost of an MRI . The low cost makes it practical for routine clinical use. Consequently, the use of sequence data in clinical practice is also increasing rapidly and may be routine in just a few years . In fact, genome data are increasingly used in eligibility criteria for clinical trials. Currently, there are over 34,000 active clinical trials in the United States . Algorithms and software tools are being developed for automatically matching patients and trials based on genomic data  to help increase enrollment.
The criminal imagination and the theft of genomic information
As shown above, the benefits of genomic data analysis in medical care are immense. But preserving privacy must go hand-in-hand with using sequence data. Human genomic data are highly sensitive due to their uniqueness and predictive value. Any leakage of the data is irrevocable because unlike credit cards or social security numbers, this data is permanently attached to the individual. Once the information is revealed, it cannot be controlled and can be misused. A person whose genetic information becomes publicly available can be denied insurance, employment or loans if the determining authority decides the application is risky, regardless of the merit of the assessment , . Questions of paternity, hereditary illnesses or conditions can be made public. These issues can result in social embarrassment or be surrounded by superstition. Discrimination can also affect relatives due to the shared DNA. Therefore, the victim of stolen genetic data can be extorted too. It has even been suggested that DNA can be synthesized and planted to frame someone for a crime .
Maintaining security is often at the discretion of a few scientists, and an individual can be responsible for the breakdown of privacy . Even de-identification of data may not suffice. Merely 75 single-nucleotide polymorphisms (SNP) out of billions are sufficient to uniquely re-identify an individual  and a few dozens of database queries can determine the membership of a victim in the database –.
A challenge in securing genome data is its volume. A whole genome sequence may occupy gigabytes of storage. Most healthcare institutions cannot store, manage and encrypt data of this size for tens of thousands of their patients. The patient’s data also needs to be shared with other healthcare institutions where a patient may be receiving care. A cloud server can store and manage the data, but the involvement of a third party increases the possibility of data breaches. Millions of patient and healthcare records are breached every year. DNA-testing services have inadvertently exposed the data of patients due to insufficient software protection or purposefully sold customer data .
Old methods for a new frontier
Until recently, the ease of access to genomic data for analysis was at odds with data security. Common encryption methods are encryption at rest- they keep the encrypted data (“ciphertext”) secure as long as it is just in storage and not being analyzed. But analyzing the data, such as checking if a patient has a variant in a gene, requires the data to be unencrypted. When the data are unencrypted for computation, the “plaintext” data are vulnerable again.
In the past few years, privacy and cryptographic techniques for secure computation have been extensively studied. Multi-party computation (MPC) is considered a promising method of secure computation . In this approach, multiple parties maintain local data and communicate intermediate results. MPC can be inefficient and vulnerable when the computing parties collude, and is thus inappropriate for long-term storage and outsourcing computation .
Computation on ciphertext
Homomorphic encryption (HE) methods encrypt the data in a way that allows mathematical operations to be done directly on ciphertext. The results, when unencrypted, are the same as the results of the operations done on plaintext. This allows us not only to store data on a cloud server, but also to outsource computations on it without compromising privacy. Since no plaintext is required on the cloud server, the data are never vulnerable. The first HE schemes were developed in the 1970s. Gentry et al derived encryption operations to allow addition and multiplication in 2009 . Security is guaranteed by cryptographic hardness assumptions, which quantum computers cannot break. Until recently, HE was considered too slow to be useful in commercial systems. However, researchers at the University of Texas Health Center (UTH), Miran Kim and Xiaoqian Jiang, have made HE with real numbers more efficient . They have also developed a new algorithm to efficiently multiply ciphertext matrices, which has sped up computations enough to be practically useful .
What remains for genomics-based decision support is this: how do we pose our questions of sequencing data in its ciphertext space? The data and questions must have a mathematical representation to allow mathematical operations on ciphertext. If we can encrypt the questions as well as the data they are applied to, then the cloud server will be unable to interpret the data, the results, or even the question.
This idea is at the center of Elimu’s new project to provide clinical decision support with genomic data. This project is sponsored by the National Institute of Health (NIH), and is in partnership with Drs Kim and Jiang at UTH. We are developing a framework to represent variant data and genomics questions. In this framework, gene variants comprise a basis set of vectors and we can apply linear transformations to these vectors. Therefore, our questions are represented by matrices. The patient vectors and question matrices are all homomorphically encrypted. The matrix multiplication of the HE data is efficiently carried out using UTH’s algorithms.
In this project we endeavor to answer genomics questions for clinical trials matching, pharmacogenomics and gene reanalysis. Currently, we can answer questions related to alleles and genotypes, and questions with mappings to alleles or genotypes. For instance, we can answer “is patient ‘X’ a good metabolizer of clopidogrel?” We can efficiently answer questions for populations of patients by horizontally concatenating patient vectors. We can also answer probabilistic questions when genotypes are ambiguous.
We have developed a computer client-server model. A genome sequencing laboratory sends patient sequence files in variant call format (VCF) to the client, such as a healthcare institution. The client obtains a secret key, homomorphically encrypts the data and sends it to the cloud server for storage. Subsequently, questions can be posed by a clinician using the client. The client generates a question matrix, encrypts it, and sends to the server. The server does the computation and sends the ciphertext result back to the client, who unencrypts it for the clinician with the help of the secret key.
As HE continues to become more efficient, and more mathematical operations can be added, it will become possible to apply machine learning algorithms to analyze data. This will allow researchers access to large secure databases of genomic data where there was previously none, due to the risk of data exposure. And the availability of large datasets will help immensely to advance precision medicine.
Footnote: This research is being funded by Grant 1R41HG010978-01 from the National Human Genome Research Institute of the National Institutes of Health.
 G. C. Bell et al., “Development and use of active clinical decision support for preemptive pharmacogenomics.,” J. Am. Med. Inform. Assoc., vol. 21, no. e1, pp. e93-9, Feb. 2014.
 M. V Relling and W. E. Evans, “Pharmacogenomics in the clinic.,” Nature, vol. 526, no. 7573, pp. 343–50, Oct. 2015.
 S. S. Kalia et al., “Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics.,” Genet. Med., vol. 19, no. 2, pp. 249–255, 2017.
 M. O. Dorschner et al., “Actionable, pathogenic incidental findings in 1,000 participants’ exomes.,” Am. J. Hum. Genet., vol. 93, no. 4, pp. 631–40, Oct. 2013.
 M.-A. Jang, S.-H. Lee, N. Kim, and C.-S. Ki, “Frequency and spectrum of actionable pathogenic secondary findings in 196 Korean exomes.,” Genet. Med., vol. 17, no. 12, pp. 1007–11, Dec. 2015.
 M. L. Thompson et al., “Genomic sequencing identifies secondary findings in a cohort of parent study participants.,” Genet. Med., vol. 20, no. 12, pp. 1635–1643, 2018.
 M. J. Landrum et al., “ClinVar: public archive of interpretations of clinically relevant variants.,” Nucleic Acids Res., vol. 44, no. D1, pp. D862-8, Jan. 2016.
 “Now You Can Sequence Your Whole Genome for Just $200 | WIRED.” [Online]. Available: https://www.wired.com/story/whole-genome-sequencing-cost-200-dollars/. [Accessed: 21-Oct-2019].
 C. Turnbull et al., “The 100 000 Genomes Project: Bringing whole genome sequencing to the NHS,” BMJ, vol. 361, 2018.
 “Home – ClinicalTrials.gov.” [Online]. Available: https://clinicaltrials.gov/. [Accessed: 22-Oct-2019].
 J. Lindsay et al., “MatchMiner: An open source computational platform for real-time matching of cancer patients to precision medicine clinical trials using genomic and clinical criteria,” bioRxiv, p. 199489, 2017.
 P. R. Reilly, “Genetic risk assessment and insurance.,” Genet. Test., vol. 2, no. 1, pp. 1–2, 1998.
 M. Naveed et al., “Privacy in the Genomic Era.,” ACM Comput. Surv., vol. 48, no. 1, Sep. 2015.
 S. E. Brenner, “Be prepared for the big genome leak,” Nature, vol. 498, no. 7453, p. 139, 2013.
 Z. Lin, A. B. Owen, and R. B. Altman, “Genetics. Genomic research and human subject privacy.,” Science, vol. 305, no. 5681, p. 183, Jul. 2004.
 S. S. Shringarpure and C. D. Bustamante, “Privacy risks from genomic data-sharing beacons,” Am. J. Hum. Genet., 2015.
 J. L. Raisaro et al., “Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks.,” J. Am. Med. Inform. Assoc., vol. 24, no. 4, pp. 799–805, Jul. 2017.
 N. von Thenen, E. Ayday, and A. E. Cicek, “Re-identification of individuals in genomic data-sharing beacons via allele inference.,” Bioinformatics, vol. 35, no. 3, pp. 365–371, 2019.
 A. Chen, “Why a DNA data breach is much worse than a credit card leak – The Verge.” [Online]. Available: https://www.theverge.com/2018/6/6/17435166/myheritage-dna-breach-genetic-privacy-bioethics. [Accessed: 22-Oct-2019].
 A. C. C. Yao, “HOW TO GENERATE AND EXCHANGE SECRETS.,” in Annual Symposium on Foundations of Computer Science (Proceedings), 1986, pp. 162–167.
 I. Damgård, V. Pastro, N. Smart, and S. Zakarias, Multiparty computation from somewhat homomorphic encryption, vol. 7417 LNCS. 2012.
 C. Gentry, Fully Homomorphic Encryption Using Ideal Lattices. 2009.
 M. Kim and K. Lauter, “Private genome analysis through homomorphic encryption,” BMC Med. Inform. Decis. Mak., vol. 15, no. 5, Dec. 2015.
 X. Jiang, M. Kim, K. Lauter, and Y. Song, “Secure Outsourced Matrix Computation and Application to Neural Networks,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security – CCS ’18, 2018, pp. 1209–1222.