For the past few months, I’ve been doing computational biology research. When explaining my research to others, one of the most common questions I get is “what is computational biology?”
This always strikes me slightly by surprise, especially when the question is from STEM students, but the general lack of knowledge about computational biology is understandable:
- Unlike natural language processing or other hot areas of CS/CS-adjacent research, computational biology requires some domain knowledge to understand what the exciting and relevant problems are. For example, a sixth grader could (at a very high level) understand that NLP scientists want to develop fast and accurate translation algorithms that model the structure of grammar and relationships between words. In computational biology, many of the interesting problems require some basic understanding of DNA, RNA, and protein structures, which may be unfamiliar to some audiences.
- Computational biology hasn’t built consumer goods in the ways other computer science fields have. There are no self-driving cars, no Google Translates, and no smartphones that computational biologists have built to revolutionize healthcare at the same scale. Perhaps, someday, if computational drug discovery and personalized medicine do come to fruition, we may see the same hype around computational biology as we do for other computational fields. The largest reason for this lack of products is that we don’t understand the science of biology as well as we do the science of linguistics (or any other CS-adjacent field). Like so many “computational X” fields, our results are constrained by our domain knowledge, and in biology, we don’t know the answers to many questions.
That being said, I’m excited about computational biology. I, personally, have two reasons. The first is that many interesting problems in CS/Stat have some application or analogue in computational biology.
Are you interested in building efficient data structures for pattern matching? Perhaps you can create an efficient method to map reads from an RNA-seq experiment to a reference transcript using k-mer (just substrings of length k) hashing and a structure known as the colored de Bruijn graph.
Do you enjoy researching traditional statistical methods to parse signals from noise? There are all sorts of methods to parse genetic signals for certain diseases.
Interested in approximation algorithms for NP-complete problems? There are a lot of really difficult problems relevant in biology.
Do you want to suggest pockets for potential drugs to molecularly bind to proteins? Some people are trying to solve it with deep learning.
Do you like thinking about generative models to capture higher-order dependencies between variables through latent variables? There are some applications in capturing higher-order dependencies in protein sequences.
In short, there’s a home for all types of CS/Stat-adjacent people. Second, I personally think it’s exciting to have a scientific interpretation for results. Whereas blackbox models may be alright for translation, where all we care about is the right answer, in comp bio we really want to use computer science and statistics to uncover hidden truths about biology (although there certainly are some problems for which just the right answer matters). This makes every problem all the more interesting, as each requires domain knowledge along with all the normal CS/Stat concepts.
So what actually is computational biology? I’d like to spend the second part of this post describing my broad framework for computational biology and how I think different areas of research fit into this framework.
Computational biology is a broad set of often-disparate biological problems which can be solved — or just better understood — using computational methods. We hope that these problems can help us better patients’ lives, mostly by filling in the (many) gaps to find cures for diseases, a process which in its current state is notoriously expensive and slow. The current pipeline for this process is (very crudely) as follows:
- Read and use patient data (e.g. RNA-seqs, DNA-seqs, etc.) to provide a basis for analysis. How can we efficiently store and search patient data? How can we create platforms so that other scientists can easily check their data against others? Computational biologists have made immense contributions to this step, from BLAST to the Protein Data Bank to everything in between.
- Analyze this data and find representations that give you insights into how certain diseases operate at a genetic level. What genes are overexpressed in patients with cancer X? How do these genes regulate each other? Computational biologists have also gotten “pretty good” at this data analysis step too, developing methods like GSEA and RNA velocity.
- Combine this genetic data with other biological data to build a more comprehensive picture of disease.
- Use these disease pictures to find exact proteins and biological pathways responsible for these diseases, and learn how these proteins and pathways are mutated. One huge open problem in this domain is protein folding, but there are all sorts of interesting problems from molecular dynamics to mutational effects.
- Once we’ve found a set of proteins and mutations to those proteins responsible for a disease, suggest molecules to target these proteins and suppress the mutations/change the mutated protein’s function in some way.
Of course, this is a highly reductive view of computational biology, and many problems fall into multiple buckets, or don’t quite fit into any particular one, but it still provides a good framework for understanding the tasks of computational biologists. Integrating all these pieces together is extraordinarily difficult, and each piece requires its own domain knowledge, which perhaps explains why we haven’t significantly shortened the drug discovery pipeline.
I think we’ve gotten fairly good at (1), (2), and to some extent, (3). (4) and (5) seem to remain elusive, although they are becoming hotter areas of research. My own research roughly fits into (4).
First, some biological background relevant to my project. Proteins are responsible for carrying out various functions in the cell and are built as a chain of hundreds of amino acids folded together to create a large molecule. Each amino acid is encoded by a sequence of DNA, so mutations to the exome can result in structural changes to the protein. At a high level, we believe that there’s only a small set of proteins out of the universe of 20,000+ proteins that are responsible for diseases like diabetes (through mutations). Within this small set, we further believe there are dense “clusters” of mutations that make diabetes more likely (i.e. the diabetes mutations aren’t just randomly scattered, they’re localized to some part of the protein).
To solve this, we have a large dataset of patients’ sequenced exomes (the genome but without junk stuff that doesn’t code for any proteins), some with diabetes, and some without. We then observe mutations — or SNPs — that occur in the exome and assign each SNP observed a “directionality” (beta) indicating whether the mutation is more often observed in diabetic patients, positive, or control patients, negative (the exact beta and its standard error is calculated using GWAS). Many people end their analysis here and just identify the corresponding genes that are mutated, but we want to take our analysis further and find corresponding structural changes in proteins.
In particular, we’re developing an algorithm to test whether the amino acid changes corresponding to genetic (or exomic) mutations tend to cluster together in a small set of proteins at some significance level (we don’t want to observe clustering in a large number of proteins, as that’d imply many proteins are highly responsible for diabetes, which seems unlikely). We’re hoping to combine some recent developments in deep generative models, which help capture higher-order dependencies in protein sequences, with data and knowledge we have about phenotypes to quantify a mutated sequence’s impact on a particular phenotype.
Current models either (1) consider mutations independently (i.e. don’t consider that mutations may have correlated effects) or (2) model mutation correlations with latent variables, but don’t extend this framework to effects on a particular disease/phenotype. Integrating these two models would be the gold standard, but even some framework for understanding the relation between these methods would be helpful.
If we can indeed find, say, 10 proteins and “pockets” of mutations, we could use strategies like virtual screening to suggest potential drugs for diabetes. While we’re applying our technique to diabetes, such an algorithm could be applicable to many diseases.
There is a lot of promising research in all of computational biology, and I hope this post gave you a flavor of the types of questions computational biologists like to ask.
Important area for humankind to divert research attention towards a better future, especially at times of high consumerism aided by technology. However, the dilemma remains whether more advancement against disease is desirable for humankind in a macro sense or not when the world population is exceeding the carrying capacity of the planet. We perhaps do not have an answer.
LikeLike