Are easy access and security in genome analysis compatible? (Yes!)

Nov 5, 2019 |

<span style=”font-weight: 400;”>Human genome data can be analyzed to derive immense healthcare benefits, and equally misused with disastrous consequences for victims of its theft. While we want to analyze large quantities of genetic data to maximize our knowledge, we need to keep this data completely secure. A new framework for genomic data analysis using brand-new mathematical techniques for data encryption gives both freedom and security.</span>

<b>Every body is different – medically speaking</b>

<span style=”font-weight: 400;”>New discoveries are rapidly being made linking genetic variants with drug interactions and with disease risks in healthcare. Driven by these discoveries is the field of precision medicine, which means managing or treating patients based on their metabolism and risk factors, rather than by applying uniform solutions to all patients.  </span>

<span style=”font-weight: 400;”>We know that certain drugs that successfully treat illness in some people are ineffective for other people. The root-cause of these variations in effectiveness may lie in genetic differences from person to person. Current findings in pharmacogenomics, the study of how variants in genes determine drug metabolism, tell us that about half of all primary care patients are exposed to drugs whose metabolism is affected by genes [1] . Studies have found that 18% of the 4 billion prescriptions written in the US per year are affected by such genetic variations [2]. </span>

<span style=”font-weight: 400;”>Sequence data are also used to predict disease risk. For instance, the American College of Genetics and Genomics (ACMG) recommends reporting secondary findings in 56 genes [3]. As many as 7% of patients harbor one of nearly 19,000 pathogenic or likely pathogenic variants in these 56 ACMG genes [4]–[7]. Early detection of the conditions caused by these genes will allow clinicians to act on them and reduce the risk to the patient.</span>

<span style=”font-weight: 400;”>The cost of whole genome sequencing has decreased dramatically in recent years and is expected to continue to decrease. Currently, next generation sequencing of a whole genome is a fraction of the cost of an MRI [8]. The low cost makes it practical for routine clinical use. Consequently, the use of sequence data in clinical practice is also increasing rapidly and may be routine in just a few years [9]. In fact, genome data are increasingly used in eligibility criteria for clinical trials. Currently, there are over 34,000 active clinical trials in the United States [10]. Algorithms and software tools are being developed for automatically matching patients and trials based on genomic data [11] to help increase enrollment.</span>

<b>The criminal imagination and the theft of genomic information</b>

<span style=”font-weight: 400;”>As shown above, the benefits of genomic data analysis in medical care are immense. But preserving privacy must go hand-in-hand with using sequence data. Human genomic data are highly sensitive due to their uniqueness and predictive value. Any leakage of the data is irrevocable because unlike credit cards or social security numbers, this data is permanently attached to the individual. Once the information is revealed, it cannot be controlled and can be misused. A person whose genetic information becomes publicly available can be denied insurance, employment or loans if the determining authority decides the application is risky, regardless of the merit of the assessment [12], [13]. Questions of paternity, hereditary illnesses or conditions can be made public. These issues can result in social embarrassment or be surrounded by superstition. Discrimination can also affect relatives due to the shared DNA. Therefore, the victim of stolen genetic data can be extorted too. It has even been suggested that DNA can be synthesized and planted to frame someone for a crime [10]. </span>

<span style=”font-weight: 400;”>Maintaining security is often at the discretion of a few scientists, and an individual can be responsible for the breakdown of privacy [14].  Even de-identification of data may not suffice. Merely 75 single-nucleotide polymorphisms (SNP) out of billions are sufficient to uniquely re-identify an individual [15] and a few dozens of database queries can determine the membership of a victim in the database [16]–[18].</span>

<span style=”font-weight: 400;”>A challenge in securing genome data is its volume. A whole genome sequence may occupy gigabytes of storage. Most healthcare institutions cannot store, manage and encrypt data of this size for tens of thousands of their patients. The patient’s data also needs to be shared with other healthcare institutions where a patient may be receiving care. A cloud server can store and manage the data, but the involvement of a third party increases the possibility of data breaches. Millions of patient and healthcare records are breached every year.  DNA-testing services have inadvertently exposed the data of patients due to insufficient software protection or purposefully sold customer data [19].</span>

<b>Old methods for a new frontier</b>

<span style=”font-weight: 400;”>Until recently, the ease of access to genomic data for analysis was at odds with data security. Common encryption methods are encryption at rest- they keep the encrypted data (“ciphertext”) secure as long as it is just in storage and not being analyzed. But analyzing the data, such as checking if a patient has a variant in a gene, requires the data to be unencrypted. When the data are unencrypted for computation, the “plaintext” data are vulnerable again. </span>

<span style=”font-weight: 400;”>In the past few years, privacy and cryptographic techniques for secure computation have been extensively studied. Multi-party computation (MPC) is considered a promising method of secure computation [20]. In this approach, multiple parties maintain local data and communicate intermediate results. MPC can be inefficient and vulnerable when the computing parties collude, and is thus inappropriate for long-term storage and outsourcing computation [21]. </span>

<b>Computation on ciphertext</b>

<span style=”font-weight: 400;”>Homomorphic encryption (HE) methods encrypt the data in a way that allows mathematical operations to be done directly on ciphertext. The results, when unencrypted, are the same as the results of the operations done on plaintext. This allows us not only to store data on a cloud server, but also to outsource computations on it without compromising privacy. Since no plaintext is required on the cloud server, the data are never vulnerable. The first HE schemes were developed in the 1970s. Gentry et al derived encryption operations to allow addition and multiplication in 2009 [22]. Security is guaranteed by cryptographic hardness assumptions, which quantum computers cannot break. Until recently, HE was considered too slow to be useful in commercial systems. However, researchers at the University of Texas Health Center (UTH), Miran Kim and Xiaoqian Jiang, have made HE with real numbers more efficient [23].  They have also developed a new algorithm to efficiently multiply ciphertext matrices, which has sped up computations enough to be practically useful [24].</span>

<span style=”font-weight: 400;”>What remains for genomics-based decision support is this: how do we pose our questions of sequencing data in its ciphertext space? The data and questions must have a mathematical representation to allow mathematical operations on ciphertext. If we can encrypt the questions as well as the data they are applied to, then the cloud server will be unable to interpret the data, the results, or even the question. </span>

<span style=”font-weight: 400;”>This idea is at the center of Elimu’s </span><a href=”″><span style=”font-weight: 400;”>new project</span></a><span style=”font-weight: 400;”> to provide clinical decision support with genomic data. This project is sponsored by the National Institute of Health (NIH), and is in partnership with Drs Kim and Jiang at UTH. We are developing a framework to represent variant data and genomics questions. In this framework, gene variants comprise a basis set of vectors and we can apply linear transformations to these vectors. Therefore, our questions are represented by matrices. The patient vectors and question matrices are all homomorphically encrypted. The matrix multiplication of the HE data is efficiently carried out using UTH’s algorithms.</span>

<span style=”font-weight: 400;”>In this project we endeavor to answer genomics questions for clinical trials matching, pharmacogenomics and gene reanalysis. Currently, we can answer questions related to alleles and genotypes, and questions with mappings to alleles or genotypes. For instance, we can answer “is patient ‘X’ a good metabolizer of clopidogrel?” We can efficiently answer questions for populations of patients by horizontally concatenating patient vectors. We can also answer probabilistic questions when genotypes are ambiguous. </span>

<span style=”font-weight: 400;”>We have developed a computer client-server model. A genome sequencing laboratory sends patient sequence files in variant call format (VCF) to the client, such as a healthcare institution. The client obtains a secret key, homomorphically encrypts the data and sends it to the cloud server for storage. Subsequently, questions can be posed by a clinician using the client. The client generates a question matrix, encrypts it, and sends to the server. The server does the computation and sends the ciphertext result back to the client, who unencrypts it for the clinician with the help of the secret key. </span>

<span style=”font-weight: 400;”>As HE continues to become more efficient, and more mathematical operations can be added, it will become possible to apply machine learning algorithms to analyze data. This will allow researchers access to large secure databases of genomic data where there was previously none, due to the risk of data exposure. And the availability of large datasets will help immensely to advance precision medicine.</span>


<i><span style=”font-weight: 400;”>Footnote: This research is being funded by Grant </span></i><a href=”″><i><span style=”font-weight: 400;”>1R41HG010978-01</span></i></a><i><span style=”font-weight: 400;”> from the National Human Genome Research Institute of the National Institutes of Health.</span></i>

<span style=”font-weight: 400;”>[1]</span> <span style=”font-weight: 400;”>G. C. Bell </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “Development and use of active clinical decision support for preemptive pharmacogenomics.,” </span><i><span style=”font-weight: 400;”>J. Am. Med. Inform. Assoc.</span></i><span style=”font-weight: 400;”>, vol. 21, no. e1, pp. e93-9, Feb. 2014.</span>

<span style=”font-weight: 400;”>[2]</span> <span style=”font-weight: 400;”>M. V Relling and W. E. Evans, “Pharmacogenomics in the clinic.,” </span><i><span style=”font-weight: 400;”>Nature</span></i><span style=”font-weight: 400;”>, vol. 526, no. 7573, pp. 343–50, Oct. 2015.</span>

<span style=”font-weight: 400;”>[3]</span> <span style=”font-weight: 400;”>S. S. Kalia </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics.,” </span><i><span style=”font-weight: 400;”>Genet. Med.</span></i><span style=”font-weight: 400;”>, vol. 19, no. 2, pp. 249–255, 2017.</span>

<span style=”font-weight: 400;”>[4]</span> <span style=”font-weight: 400;”>M. O. Dorschner </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “Actionable, pathogenic incidental findings in 1,000 participants’ exomes.,” </span><i><span style=”font-weight: 400;”>Am. J. Hum. Genet.</span></i><span style=”font-weight: 400;”>, vol. 93, no. 4, pp. 631–40, Oct. 2013.</span>

<span style=”font-weight: 400;”>[5]</span> <span style=”font-weight: 400;”>M.-A. Jang, S.-H. Lee, N. Kim, and C.-S. Ki, “Frequency and spectrum of actionable pathogenic secondary findings in 196 Korean exomes.,” </span><i><span style=”font-weight: 400;”>Genet. Med.</span></i><span style=”font-weight: 400;”>, vol. 17, no. 12, pp. 1007–11, Dec. 2015.</span>

<span style=”font-weight: 400;”>[6]</span> <span style=”font-weight: 400;”>M. L. Thompson </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “Genomic sequencing identifies secondary findings in a cohort of parent study participants.,” </span><i><span style=”font-weight: 400;”>Genet. Med.</span></i><span style=”font-weight: 400;”>, vol. 20, no. 12, pp. 1635–1643, 2018.</span>

<span style=”font-weight: 400;”>[7]</span> <span style=”font-weight: 400;”>M. J. Landrum </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “ClinVar: public archive of interpretations of clinically relevant variants.,” </span><i><span style=”font-weight: 400;”>Nucleic Acids Res.</span></i><span style=”font-weight: 400;”>, vol. 44, no. D1, pp. D862-8, Jan. 2016.</span>

<span style=”font-weight: 400;”>[8]</span> <span style=”font-weight: 400;”>“Now You Can Sequence Your Whole Genome for Just $200 | WIRED.” [Online]. Available: [Accessed: 21-Oct-2019].</span>

<span style=”font-weight: 400;”>[9]</span> <span style=”font-weight: 400;”>C. Turnbull </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “The 100 000 Genomes Project: Bringing whole genome sequencing to the NHS,” </span><i><span style=”font-weight: 400;”>BMJ</span></i><span style=”font-weight: 400;”>, vol. 361, 2018.</span>

<span style=”font-weight: 400;”>[10]</span> <span style=”font-weight: 400;”>“Home –” [Online]. Available: [Accessed: 22-Oct-2019].</span>

<span style=”font-weight: 400;”>[11]</span> <span style=”font-weight: 400;”>J. Lindsay </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “MatchMiner: An open source computational platform for real-time matching of cancer patients to precision medicine clinical trials using genomic and clinical criteria,” </span><i><span style=”font-weight: 400;”>bioRxiv</span></i><span style=”font-weight: 400;”>, p. 199489, 2017.</span>

<span style=”font-weight: 400;”>[12]</span> <span style=”font-weight: 400;”>P. R. Reilly, “Genetic risk assessment and insurance.,” </span><i><span style=”font-weight: 400;”>Genet. Test.</span></i><span style=”font-weight: 400;”>, vol. 2, no. 1, pp. 1–2, 1998.</span>

<span style=”font-weight: 400;”>[13]</span> <span style=”font-weight: 400;”>M. Naveed </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “Privacy in the Genomic Era.,” </span><i><span style=”font-weight: 400;”>ACM Comput. Surv.</span></i><span style=”font-weight: 400;”>, vol. 48, no. 1, Sep. 2015.</span>

<span style=”font-weight: 400;”>[14]</span> <span style=”font-weight: 400;”>S. E. Brenner, “Be prepared for the big genome leak,” </span><i><span style=”font-weight: 400;”>Nature</span></i><span style=”font-weight: 400;”>, vol. 498, no. 7453, p. 139, 2013.</span>

<span style=”font-weight: 400;”>[15]</span> <span style=”font-weight: 400;”>Z. Lin, A. B. Owen, and R. B. Altman, “Genetics. Genomic research and human subject privacy.,” </span><i><span style=”font-weight: 400;”>Science</span></i><span style=”font-weight: 400;”>, vol. 305, no. 5681, p. 183, Jul. 2004.</span>

<span style=”font-weight: 400;”>[16]</span> <span style=”font-weight: 400;”>S. S. Shringarpure and C. D. Bustamante, “Privacy risks from genomic data-sharing beacons,” </span><i><span style=”font-weight: 400;”>Am. J. Hum. Genet.</span></i><span style=”font-weight: 400;”>, 2015.</span>

<span style=”font-weight: 400;”>[17]</span> <span style=”font-weight: 400;”>J. L. Raisaro </span><i><span style=”font-weight: 400;”>et al.</span></i><span style=”font-weight: 400;”>, “Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks.,” </span><i><span style=”font-weight: 400;”>J. Am. Med. Inform. Assoc.</span></i><span style=”font-weight: 400;”>, vol. 24, no. 4, pp. 799–805, Jul. 2017.</span>

<span style=”font-weight: 400;”>[18]</span> <span style=”font-weight: 400;”>N. von Thenen, E. Ayday, and A. E. Cicek, “Re-identification of individuals in genomic data-sharing beacons via allele inference.,” </span><i><span style=”font-weight: 400;”>Bioinformatics</span></i><span style=”font-weight: 400;”>, vol. 35, no. 3, pp. 365–371, 2019.</span>

<span style=”font-weight: 400;”>[19]</span> <span style=”font-weight: 400;”>A. Chen, “Why a DNA data breach is much worse than a credit card leak – The Verge.” [Online]. Available: [Accessed: 22-Oct-2019].</span>

<span style=”font-weight: 400;”>[20]</span> <span style=”font-weight: 400;”>A. C. C. Yao, “HOW TO GENERATE AND EXCHANGE SECRETS.,” in </span><i><span style=”font-weight: 400;”>Annual Symposium on Foundations of Computer Science (Proceedings)</span></i><span style=”font-weight: 400;”>, 1986, pp. 162–167.</span>

<span style=”font-weight: 400;”>[21]</span> <span style=”font-weight: 400;”>I. Damgård, V. Pastro, N. Smart, and S. Zakarias, </span><i><span style=”font-weight: 400;”>Multiparty computation from somewhat homomorphic encryption</span></i><span style=”font-weight: 400;”>, vol. 7417 LNCS. 2012.</span>

<span style=”font-weight: 400;”>[22]</span> <span style=”font-weight: 400;”>C. Gentry, </span><i><span style=”font-weight: 400;”>Fully Homomorphic Encryption Using Ideal Lattices</span></i><span style=”font-weight: 400;”>. 2009.</span>

<span style=”font-weight: 400;”>[23]</span> <span style=”font-weight: 400;”>M. Kim and K. Lauter, “Private genome analysis through homomorphic encryption,” </span><i><span style=”font-weight: 400;”>BMC Med. Inform. Decis. Mak.</span></i><span style=”font-weight: 400;”>, vol. 15, no. 5, Dec. 2015.</span>

<span style=”font-weight: 400;”>[24]</span> <span style=”font-weight: 400;”>X. Jiang, M. Kim, K. Lauter, and Y. Song, “Secure Outsourced Matrix Computation and Application to Neural Networks,” in </span><i><span style=”font-weight: 400;”>Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security  – CCS ’18</span></i><span style=”font-weight: 400;”>, 2018, pp. 1209–1222.</span>

<h2 style=”text-align: center;”>Featured Blog</h2>
<h3 class=”fl-heading” style=”text-align: center;”><a href=”” target=”_blank” rel=”noopener noreferrer”><span class=”fl-heading-text”>The Combined Use of Clinical &amp; Claims Data to Enhance Clinical Insights</span></a></h3>
<h2 style=”text-align: center;”>Our Favorite Recent Reads</h2>
<h3 style=”text-align: center;”><a href=”” target=”_blank” rel=”noopener noreferrer”>Genomics and electronic health record systems</a></h3>
<h3 style=”text-align: center;”><a href=”″ target=”_blank” rel=”noopener noreferrer”>Genomic Analysis in the Age of Human Genome Sequencing</a></h3>
<h3 class=”ArticleHeader-headline” style=”text-align: center;”><a href=”” target=”_blank” rel=”noopener noreferrer”>Google Health is finally opening up about its plans, and they’re all about search</a></h3>

Get Elimu in your Email

Leave a Comment