What Even Is Computational Biology?

For the past few months, I’ve been doing computational biology research.  When explaining my research to others, one of the most common questions I get is “what is computational biology?”  

This always strikes me slightly by surprise, especially when the question is from STEM students, but the general lack of knowledge about computational biology is understandable:

  1. Unlike natural language processing or other hot areas of CS/CS-adjacent research, computational biology requires some domain knowledge to understand what the exciting and relevant problems are.  For example, a sixth grader could (at a very high level) understand that NLP scientists want to develop fast and accurate translation algorithms that model the structure of grammar and relationships between words.  In computational biology, many of the interesting problems require some basic understanding of DNA, RNA, and protein structures, which may be unfamiliar to some audiences.
  2. Computational biology hasn’t built consumer goods in the ways other computer science fields have.  There are no self-driving cars, no Google Translates, and no smartphones that computational biologists have built to revolutionize healthcare at the same scale.  Perhaps, someday, if computational drug discovery and personalized medicine do come to fruition, we may see the same hype around computational biology as we do for other computational fields.  The largest reason for this lack of products is that we don’t understand the science of biology as well as we do the science of linguistics (or any other CS-adjacent field).  Like so many “computational X” fields, our results are constrained by our domain knowledge, and in biology, we don’t know the answers to many questions.

That being said, I’m excited about computational biology.  I, personally, have two reasons.  The first is that many interesting problems in CS/Stat have some application or analogue in computational biology.

Are you interested in building efficient data structures for pattern matching? Perhaps you can create an efficient method to map reads from an RNA-seq experiment to a reference transcript using k-mer (just substrings of length k) hashing and a structure known as the colored de Bruijn graph.

Do you enjoy researching traditional statistical methods to parse signals from noise?  There are all sorts of methods to parse genetic signals for certain diseases.

Interested in approximation algorithms for NP-complete problems?  There are a lot of really difficult problems relevant in biology.

Do you want to suggest pockets for potential drugs to molecularly bind to proteins?  Some people are trying to solve it with deep learning.

Do you like thinking about generative models to capture higher-order dependencies between variables through latent variables?  There are some applications in capturing higher-order dependencies in protein sequences.

In short, there’s a home for all types of CS/Stat-adjacent people.  Second, I personally think it’s exciting to have a scientific interpretation for results.  Whereas blackbox models may be alright for translation, where all we care about is the right answer, in comp bio we really want to use computer science and statistics to uncover hidden truths about biology (although there certainly are some problems for which just the right answer matters).  This makes every problem all the more interesting, as each requires domain knowledge along with all the normal CS/Stat concepts.

So what actually is computational biology?  I’d like to spend the second part of this post describing my broad framework for computational biology and how I think different areas of research fit into this framework.  

Computational biology is a broad set of often-disparate biological problems which can be solved — or just better understood — using computational methods.  We hope that these problems can help us better patients’ lives, mostly by filling in the (many) gaps to find cures for diseases, a process which in its current state is notoriously expensive and slow.  The current pipeline for this process is (very crudely) as follows:

  1. Read and use patient data (e.g. RNA-seqs, DNA-seqs, etc.) to provide a basis for analysis.  How can we efficiently store and search patient data? How can we create platforms so that other scientists can easily check their data against others?  Computational biologists have made immense contributions to this step, from BLAST to the Protein Data Bank to everything in between.
  2. Analyze this data and find representations that give you insights into how certain diseases operate at a genetic level.  What genes are overexpressed in patients with cancer X?  How do these genes regulate each other?  Computational biologists have also gotten “pretty good” at this data analysis step too, developing methods like GSEA and RNA velocity.
  3. Combine this genetic data with other biological data to build a more comprehensive picture of disease.
  4. Use these disease pictures to find exact proteins and biological pathways responsible for these diseases, and learn how these proteins and pathways are mutated.  One huge open problem in this domain is protein folding, but there are all sorts of interesting problems from molecular dynamics to mutational effects.
  5. Once we’ve found a set of proteins and mutations to those proteins responsible for a disease, suggest molecules to target these proteins and suppress the mutations/change the mutated protein’s function in some way.

Of course, this is a highly reductive view of computational biology, and many problems fall into multiple buckets, or don’t quite fit into any particular one, but it still provides a good framework for understanding the tasks of computational biologists.  Integrating all these pieces together is extraordinarily difficult, and each piece requires its own domain knowledge, which perhaps explains why we haven’t significantly shortened the drug discovery pipeline.

I think we’ve gotten fairly good at (1), (2), and to some extent, (3).  (4) and (5) seem to remain elusive, although they are becoming hotter areas of research.  My own research roughly fits into (4). 

First, some biological background relevant to my project.  Proteins are responsible for carrying out various functions in the cell and are built as a chain of hundreds of amino acids folded together to create a large molecule.  Each amino acid is encoded by a sequence of DNA, so mutations to the exome can result in structural changes to the protein.  At a high level, we believe that there’s only a small set of proteins out of the universe of 20,000+ proteins that are responsible for diseases like diabetes (through mutations).  Within this small set, we further believe there are dense “clusters” of mutations that make diabetes more likely (i.e. the diabetes mutations aren’t just randomly scattered, they’re localized to some part of the protein).

To solve this, we have a large dataset of patients’ sequenced exomes (the genome but without junk stuff that doesn’t code for any proteins), some with diabetes, and some without.  We then observe mutations — or SNPs — that occur in the exome and assign each SNP observed a “directionality” (beta) indicating whether the mutation is more often observed in diabetic patients, positive, or control patients, negative (the exact beta and its standard error is calculated using GWAS).  Many people end their analysis here and just identify the corresponding genes that are mutated, but we want to take our analysis further and find corresponding structural changes in proteins.

In particular, we’re developing an algorithm to test whether the amino acid changes corresponding to genetic (or exomic) mutations tend to cluster together in a small set of proteins at some significance level (we don’t want to observe clustering in a large number of proteins, as that’d imply many proteins are highly responsible for diabetes, which seems unlikely).  We’re hoping to combine some recent developments in deep generative models, which help capture higher-order dependencies in protein sequences, with data and knowledge we have about phenotypes to quantify a mutated sequence’s impact on a particular phenotype. 

Current models either (1) consider mutations independently (i.e. don’t consider that mutations may have correlated effects) or (2) model mutation correlations with latent variables, but don’t extend this framework to effects on a particular disease/phenotype. Integrating these two models would be the gold standard, but even some framework for understanding the relation between these methods would be helpful.

If we can indeed find, say, 10 proteins and “pockets” of mutations, we could use strategies like virtual screening to suggest potential drugs for diabetes.  While we’re applying our technique to diabetes, such an algorithm could be applicable to many diseases.

There is a lot of promising research in all of computational biology, and I hope this post gave you a flavor of the types of questions computational biologists like to ask.

On Voting Paradoxes

My post last week sent me down a Wikipedia/Google rabbit hole on voting systems and paradoxes.  I’d heard in passing about voting paradoxes but never took some time to explore them.  In this post, I’ll discuss how we can use graphs to model elections and then explore the ideas behind voting paradoxes.

I. A Graph Reduction for Elections

To start, let’s define some notation (note: a lot of the ideas here are from this paper, so check it out if you’re interested!).  Suppose we have m candidates, who we’ll call c_1, \ldots, c_m, and n voters, with each voter casting a ballot ranking some or all of the candidates.  Each ballot is an ordered set B_i = \{c_{(1)}, \ldots , c_{(m)}\} corresponding to the voter’s preferences.  We can then construct a collection C of ballots \{B_1, \ldots, B_n\} of everyone’s votes, which we’ll call the election profile.  Our goal is to determine a “fair” method to select a winning candidate.  If we choose to represent the election in matrix form, then let M_{i,j} = \sum_{B_k \in C} I(B_k \text{ ranks i over j}) - I(B_k \text{ ranks j over i}) (intuitively, it’s just the margin of voters who prefer i to j).  Note that this matrix is antisymmetric (M^T = -M) with a zero diagonal.

However, we can model this election as a graph, with vertices corresponding to candidates and directed edges corresponding to the margin of votes between any pair of candidates.  In particular, if there are m total voters and k ballots rank candidate A over B (so m-k rank B over A, for simplicity) with k > m-k, then there is an edge from A to B with weight k – (m-k).  Conversely, if m-k > k, then there is an edge from B to A with weight m-k-k (and if m-k = k, no edge exists). 

(Feel free to skip this section if you’re comfortable so far)  To see the graph and matrix representations of the problem more concretely, let’s imagine an election with four candidates, {A, B, C, D}, and 100 voters.  The ballots break down as follows:

# votersPref. 1Pref. 2Pref. 3Pref. 4Ballot ID
33ABCD1
34BCAD2
32CABD3
1ADBC4
Table 1: A paradoxical election (note Ballot ID is just a way to identify distinct ballots with the same ordering)

To determine the edge weight and direction between B and D, for example, note that 99 ballots prefer B to D and 1 prefers D to B, so we will have an edge from B to D with weight 99-1=98 in our graph.  In our matrix, M_{B,D} = 98 = -M_{D,B}.  The graph representation is shown in Figure 1.

Figure 1: An example paradoxical election. For the edge from A to B, 66 voters prefer A to B and 34 B to A, so the weight is 66-34=32.

There are some paradoxical results here: if we choose A as the winner, note that ballots of type 1 and 3 both prefer C to A, of which there are 65 total voters, so a majority prefer C to A.  However, 67 voters prefer B to C, and 66 voters prefer A to B, yielding a paradox.  We could break this tie by counting who is most preferred over D, but that would make the last ballot a “dictator” ballot (in that one person’s vote is used to break a tie in favor of A, even though C is preferred to A).

This reduction (/framework) can actually provide some good intuition on elections, as we can translate graph problems and their corresponding algorithms to voting systems, and vice versa.  For example, suppose we wish to determine the smallest set of candidates, S such that every candidate in S would beat every candidate not in S in a head-to-head matchup (economists call this set the Smith Set).  Equivalently, we have a directed graph over all the candidates, and we wish to find the smallest set of vertices such that there are no incoming edges and there exists at least one path to every vertex outside of S.  But this set is just a source strongly connected component: we have a set of vertices with a cycle amongst them, but there are no incoming edges. 

Tarjan’s algorithm can find us a source SCC in linear time, and we can check whether every vertex outside the set is reachable from every vertex in the set (equivalently, whether every other candidate outside the Smith Set is beatable) by running a Depth First Search (DFS) from one of the vertices in the source SCC (to prove this, start by noting that an edge exists between any two candidates except in the case of a tie, and use SCC properties from there).  Figure 2 shows an example.  In fact, any graph ordering algorithm, such as topological sorting, can be interpreted as a method for ordering candidates in some way (for every pair of candidates A, B with A before B in the topological sort, A must not lose to B).

Figure 2: Tarjan’s algorithm gives an ordering of SCCs. {A,B,C} is the Smith Set.

In practice, finding the Smith Set can be used to winnow down a large pool of candidates to a smaller pool for a “runoff” election such that every candidate in the runoff would beat every candidate not in the runoff in a head-to-head election.

It turns out that any deterministic method to select a winner (either by assigning weights to votes, or using some other method) can be broken with some paradoxical election.  The paper I linked earlier proposes a randomized solution, which I discuss in the second half of the post.

II. A Randomized Solution for Paradoxes

To evaluate voting systems, we need some method for comparing two systems (i.e. some notion of voting system X, like first past the post, being better than voting system Y, like ranked-choice).  First, a voting system is merely a function which takes an election profile C of all ballots cast as input and outputs a single winning candidate.  The exact profile, C, is one drawn from a distribution D_C, assigning probabilities to specific election profiles (this is natural; we tend to think of our own elections as having some level of noise). That is, each possible election profile has an associated probability. If there were three candidates and 10 voters, our distribution could assign a probability of 0.1 to (a profile of) 9 voters selecting {A, B, C} and 1 voter selecting {C, B, A}, for example. Every possible profile would have an assigned probability, which would give our distribution of election profiles.

Let P and Q be specific voting systems, with P(C) = x, Q(C) = y.  That is, on a given profile C, P chooses x as the winner, and Q chooses y.  The relative advantage of a voting system is just:

\text{Adv}_{D_C}(P,Q) = E_{D_C} (M_{x,y} / |C|)

It’s important to keep track of what’s fixed and what’s random here.  The quantity inside expectation says: “given a particular profile C, what is the margin by which voters prefer P’s winner to Q’s winner, normalized by the total number of ballots.”  If P selects a winner less preferred than Q, then M(x,y) < 0, and vice versa.  We wish to calculate this quantity over all profiles, and thus take an expectation with respect to the distribution of profiles.  More simply, P and Q give winners x and y for a particular profile C. We can calculate a “margin” M(x,y) giving the number of voters who prefer x to y, as well as the probability that C is the actual profile. We now have two quantities: a margin and an associated probability, so we can take the expected margin between voting systems P and Q. Intuitively, the relative advantage represents how many voters prefer P’s winner to Q’s winner in an average election.

Then, P is as good as or better than Q if Adv_{D_C}(P,Q) \geq 0, and P is optimal iff for every other voting system Q, Adv_{D_C}(P,Q) \geq 0.

Given this method for comparing voting systems, we can now try to construct the optimal voting system.  To do so, we’ll borrow some ideas from game theory.  In particular, let the margin matrix M_{x,y} be a payoff matrix and let voting systems P and Q be players in a two player game on this matrix.  So, if P selects x as the winner and Q selects y, Q “pays” M(x,y) to P.  An optimal voting system P_{OPT} would have non-negative expectation no matter what strategy is used by a competing voting system.  The expected payoff of a voting system Q to P is:

\sum_{x,y} p_x q_y M(x,y)

If Q is a deterministic voting system (i.e. q_y = 1 for one candidate, 0 for all others), then P can always just select the candidate which maximizes this payoff, and always do at least as well as Q (i.e. never have more voters prefer y over x  — in the worst case, P and Q just choose the same candidate).

To see an illustration of this, see Figure 3.

Figure 3: If Q chooses a column deterministically, P can maximize payout within that column by selecting a row and always at least break even with Q.

But M is anti-symmetric, so the column player and row player (or P and Q) are interchangeable.  The optimal strategy is a mixed strategy, in which P assigns each candidate a probability of being chosen as the winner and selects a winner according to these probabilities (for two player games, this equilibrium is the Nash equilibrium).  This optimal mixed strategy can be calculated using linear programming, and no mixed strategy can beat this strategy in expectation (follows from minimax).

That was a lot, and if you want a better explanation, I’d encourage you to look at the paper, which goes into much more depth than I do here.  They also write a number of simulations comparing the randomized system to other voting systems for various election profile distributions. The basic idea, though, is simple: no voting system can beat a randomized voting system in expectation.

There are some interesting extensions of this study worth exploring, such as how the optimal strategy changes for different distributions of election profiles (in reality, we probably wouldn’t expect a uniform distribution over profiles), under what constraints a deterministic solution is best, and the characteristics of bad voting systems  — i.e. which voting systems most often lose to random?

It feels as though there’s something inherently undemocratic here, as democracy shouldn’t leave election winners to chance.  I’d agree, and I think there are practical benefits to a well-designed, (possibly) paradoxical/unfair system that everyone understands and trusts.  Maybe that system is a two-party system, maybe it’s ranked-choice as I discussed in the last post, or maybe it’s something else, but this framework for understanding voting can help guide our decisions.

That being said, I think it’d be interesting to implement a randomized voting systems in certain circles (such as selecting leaders of scientific communities?) as test runs.

Democracy and the Central Limit Theorem

In the spirit of trying to reframe legal and philosophical questions through statistics and computer science frameworks, I thought I’d write about some parallels I see between democracy (or lack thereof) and probability.  The law of large numbers and central limit theorem are ubiquitous, so it should be no surprise that we leverage it in so many algorithms and applications (like variational inference, maximum likelihood, etc.).  At its core, democracy is an application of the central limit theorem: averaging together the many (possibly extreme) preferences of people guarantees some “average” preference which is perhaps not optimal, but certainly not terrible.

Democracy is an average of preferences, but, in keeping with the previous post, let’s consider it an average of ethical frameworks as well (people’s preferences are guided by their ethical views, so this is a reasonable assumption).

Suppose we had some method for scoring people’s “goodness” and “badness.”  You and I may have different notions of what it means to be good, but our scores will almost certainly be correlated; we both think, hopefully, that it’s bad to murder and good to be kind to others.  For the sake of this post, let’s stick to one scoring method, which we’ll call S (every person i may have their own method S_i, but the expected correlation between scoring methods is quite high).

Let’s now define a distribution D_S of people over these scores.  That is, given a random person in the world, what’s the probability that they are at least as good as some score x? This will just be P(p_i \geq x) = 1 - F_{D_S}(x).

Note that with one random draw from this distribution, we are more likely to get extreme values, but, as the Central Limit Theorem tells us, as we draw more and more samples, the average will converge to the distribution’s true average (and the variance will asymptote towards zero).

This provides us with a basis for understanding democracy; we average the (possibly extreme) preferences of many, many people with:

  1. the assumption that extreme preferences are more likely to be worse for society than average preferences
  2. a belief that the average human being is good

I’d argue that these are two reasonable assumptions.  First, note that one bad apple can spoil the lot.  That is, it only takes one person with extreme preferences to ruin the lives of thousands of people with average preferences.  A great example of this would be North Korea, where Kim Jong-Un single handedly has the power to enact any law he’d like (and unfortunately exercises this power often).  In everyday life, mass shootings carried out by one person can ruin the lives of hundreds.  More mathematically, suppose 1 out of every 100 people is a psychopath and scores extremely low on the goodness rating (so probability that some random person is worse than this person is 1%!).  

Dictatorships, which are analogous to one random draw from D_S, would give a 1% chance of selecting someone this bad, and that’s not even considering the lack of accountability that may further corrupt the person’s morals!  Democracies, on the other hand, would almost surely prevent this possibility.  So, in any functional society, it’s necessary to protect the masses against the tendencies of an extreme few.

Why, then, do democracies fail to elect good leaders sometimes?  A few theories may help explain:

  1. CLT only holds under IID assumptions.  That is, each voter must be drawn from the same distribution.  This distribution is not static; it can be influenced by media, family, friends, etc., but on the day of the election, each voter is an IID draw from the distribution.  In reality, this is sometimes not the case.  Voter suppression (such as not letting certain minorities vote), election rigging, and misinformation can skew this distribution and/or violate the independence assumption.
  2. Suppressing others’ votes is probably correlated to psychopathic tendencies (or scoring “bad” on the scale), so the “bad” people are oversampled relative to the “good” people.

It’s no surprise that these ideas are not new; in fact this paper by two philosophers explores applications of Condorcet’s Jury Theorem.  The theorem is simple and follows from basic probability (some binomial sums):

“Suppose voters have to decide between one incorrect and one correct option (imagine a jury, where there is a true verdict).  If p, the probability of voting correctly, is greater than ½, then in the limit, the probability that the voters choose the correct decision is 1.  If p is less than ½, then in the limit, the probability that they vote correctly is 0.”

There are some interesting consequences of this theorem.  For example, with 100 million voters and a 50.1 percent chance of a voter voting correctly (suppose “yes” for proposition A), there’s practically no measurable difference (i.e. p-value << 0.01) between 51% of voters choosing proposition A and 70% of voters choosing A.  This seems particularly bizarre, because we traditionally think that a larger vote share gives the winning candidate a stronger mandate.  If we think a 70-30 margin is a landslide, then what’s to say a 51-49 margin is not?  After all, in elections, one could argue that our null hypothesis is the two candidates being equally popular and that the election is testing that hypothesis (what’s the point of an election if the null is that one candidate is preferred?).

The upshot of all this is that in order for democracies to work, we need voting systems in which each individual vote actually reflects the voter’s preferences.  Right now, we assume that a Bernoulli draw for each voter between two candidates (probability p of choosing A; n independent draws) is the correct model, but there are probably better systems out there.  We’re forcing people to approximate their ethical frameworks and policy preferences on a {0,1} support!

One possibility could be a ranked-choice voting model, which would expand the support of the distribution.  That is, if we have, say, 5 candidates, with 5 points to first choice, 4 points to second choice, and so on, we’d have a much larger support (120 possible orderings of the candidates).  Each voter could then choose to align their preferences from one of 120 possible scores, which would enable us to capture more information about preferences from each vote.  See the diagrams in the appendix for a more rigorous explanation.

Appendix:

Diagram for Ranked Choice Voting

I’ve included a couple diagrams below to clarify how ranked choice voting better approximates the average preferences.  Suppose for simplicity that there are only two axes, healthcare and economy, along which everyone aligns (say more positive is more privatization, more negative is more centralization).  

Note that in forcing voters to choose between two candidates, we constrain the national vote to a line drawn between these two candidates (see Figure 1).  The election therefore becomes a projection of the true national preference, V_{avg}, onto this line between candidates A and B.

With ranked choice voting and more candidates (suppose three, in Figure 2), each vote can be a weighted sum of all these candidates’ positions on healthcare and economy, so the space of possible solutions (or final vote tallies) is less constrained.  The national vote is then constrained to a triangle defined by these three points, and the election is a projection of V_{avg} onto this triangle.  Each voter V_i’s vote is a weighted average of the three candidates, and the election is an average of all V_i.

Figure 1: Two-candidate single voter system is a projection onto a line.
Figure 2: Ranked choice voting offers greater range of potential preference combinations.

Discussion of Dictator Model

Here we made the assumption that dictators are akin to one random draw from the national distribution of preferences, but relaxing this assumption actually yields even worse results for dictatorships.  There’s some negative selection here; dictators are more likely to have narcissistic and psychopathic tendencies to reach the positions they reached, so sampling a dictator from a population is probably even more skewed to the undesirable portions of the distribution.

Other Assumptions

Several people have pointed out that if the average voter has “bad” preferences, democracies will also fail. This is certainly true, and I thought about including it, but the focus of this article was a theoretical justification for democracy assuming the average voter has “good” preferences. To me, these are two separate (but both very important!) questions, as democracy hinges on two assumptions: (1) voters are “good” and (2) given that voters are “good,” the voting system approximates their preferences well.

There are a variety of problems that can complicate the assumptions in (1), including, but not limited to, the average voter being “bad,” herd mentality, tyranny of the majority, and misinformation.

There are also some issues with the ranked choice model’s feasibility in the United States, given the existing two-party structure (and all sorts of other issues). Again, this is not an endorsement for ranked choice voting’s implementation but rather a (more) mathematical exploration of the differences between voting systems.

A Statistical Analogy for Ethics

I’ve grown a little frustrated in how morality and moral law are often misused in public debate.

To start, let’s consider what a moral framework entails.  In some sense, moral philosophy is merely a model to determine whether an action is right or wrong.  Different frameworks specify exact models for understanding right and wrong; some consider the output a spectrum; others consider it a binary; others even say it’s impossible to interpret any output.

It’s important to note that the model itself is distinct from the political legitimacy of the model (i.e. how well a model is accepted as legitimate by the government). Our federal laws, for example, realize a philosophical model with strong political legitimacy, but federal laws are amended and changed when that model does not coincide with some other, “truer” philosophical framework.

This notion may seem excruciatingly obvious, but you’ll frequently hear accusations of setting “arbitrary moral guidelines” or appeals to the constitution because it was “willed by the people.”  These attacks only undermine the political legitimacy of the chosen moral framework, not the internal validity.

It’s equally frustrating when critics of a certain moral framework find tiny, irrelevant exceptions as conclusive disproofs.  Just as we wouldn’t discard a machine learning model because of occasional errors, we shouldn’t discard philosophical frameworks because we can find exceptions to them.  

Instead, a more convincing critique takes issue with the assumptions made by the model, similar to how we’d criticize a poorly applied machine learning method.

Moreover, any philosophical framework necessarily makes simplifying assumptions and prescribes categorical laws.  Utilitarians may shake their fists at me and argue that utilitarianism maximizes good, but I would argue that any framework attempts to maximize good, so this naive utilitarianism fails to prescribe action.  Without an exact set of ascribed utilities to actions, I cannot guide my action to “maximize good.”  And once you give me this set of weights, it’s almost certainly possible to find a counterexample.

Reaching moral skepticism from these observations is even more preposterous: that would be equivalent to saying “because physics is sometimes contradictory, we should discard our entire conception of the universe.”

To see how all of these pieces fit together, let’s imagine morality as a sort of mathematical problem.  Imagine we have a set of (many, many) axes which contextualize each action we take (e.g. the time of the action, who is there, their positions in society, etc.—every possible degree-of-freedom of a given action).  Each action we take is a dot in this large grid, and we wish to classify these dots as moral (“1”) or immoral (“0”) decisions (or a regression problem if you believe morality is a spectrum).  There exists a true classification (although this is debatable), and we want to find a model that best approximates this true classification.

We can, therefore, think of categorical laws such as “never steal” as analogous to simpler models like linear regression.  We can add conditions to increase model complexity, such as “never steal unless the person is six feet tall and has seven kids.”  More generally, model complexity, bias, and variance in machine learning can be analogized to moral frameworks:

  1. As we increase model complexity, we lose interpretability, which is especially concerning in moral philosophy (which serve as interpretable guides to action), so we should penalize convoluted moral frameworks. 
  2. Taken to an extreme, we can have vague and uninterpretable moral models, which, while perhaps technically “correct,” could be very “noisy” and uninterpretable, paralyzing action.  Many formulations of utilitarianism fall under this umbrella, in which someone blindly asserts that the best thing to do is to do the most good and then quantifies some metric like maximizing total happiness.  Absent a concrete method to achieve this ideal, utilitarianism can be used to justify almost any policy with some set of weights.  Think about it: when is the last time a politician made an argument and didn’t invoke some form of “everyone will be better off?”  Instead, if we trade some model complexity for categorical laws (which international law aims to do, for example), we may often achieve better results, even by the utilitarian’s own metrics, similar to how we often aim to achieve the optimal “tradeoff” between bias and variance in selecting machine learning models.  Perhaps going down this path may lead one to rule utilitarianism (https://en.wikipedia.org/wiki/Rule_utilitarianism).
  3. While simple moral models may have high bias (i.e. inaccurate in many scenarios), complex moral models will have higher variance as they paralyze action.  For example, many religious texts have complex moral guidelines, so we defer the interpretation to our denomination and priests, yielding higher model variance.  On the other hand, simple moral guidelines like the Ten Commandments might incur significant contradictions, but are far more actionable.

While small and relatively trivial, this example illustrates how we can cross-apply other fields to guide understanding of ethical systems.

It’s not a secret that different fields often employ their own jargon, which makes their ideas (slightly) inaccessible to other fields, but connecting ideas across fields (e.g. bias-variance tradeoff and moral philosophy) can illuminate how the same logic often underlies different results.