Are Neutral Models Useful? [theory]

Everything is made for the best purpose. Our noses were made to carry spectacles, so we have spectacles. Legs were clearly intended for breeches, and we wear them.

  • Dr. Pangloss (Voltaire’s Candide)

Many scientific questions begin with why.

Take evolutionary biology: why do some biologigical traits persist while others disappear? Why is there so much biodiversity?

Similar questions can be asked of the products of cultural evolution, such as language. Why do human languages look the way that they do? Why does a lexicon have the words that it does, as opposed to some other set of words?

In both cases, a satisfying explanation might invoke the notion of selection pressures. In biology, for example, some traits confer fitness (i.e., helping an organism reproduce), while others hinder it (i.e., causing an organism to die before reproducing). And in language, some words or expressions might be easier to learn (or produce, remember, etc.) than others, and thus may be more likely to be propagated across generations or communicative interactions.

It’s tempting, therefore, to attribute any observed “trait”––be it biological or linguistic––to some kind of positive selection pressure. After all, if a trait hasn’t died out, shouldn’t that mean it improves fitness? And if some trait isn’t observed, shouldn’t that mean it impairs fitness?

Panglossian Paradigms

The line of reasoning described above is compelling, but we have to be careful with teleological explanations.

One of the famous arguments against teleology in evolutionary explanations comes from Gould & Lewontin (1979): The Spandrels of San Marcos and the Panglossian Paradigm. Among other things, the paper argued that so-called “adaptationists” of the time were overly prone to invoking direct selection as an explanation for particular biological traits, as opposed to asking how those traits might have emerged as a byproduct of other selection pressures. In doing so, the authors extended the architectural term spandrel and gave it a biological meaning.

A “spandrel”, in architecture, is the triangular space between the tops of two adjacent arches (or between an arch and a rectangular frame), which tends to be decorated with elaborate paintings or engravings. Given such elaborate design, Gould & Lewontin suggest that one is tempted to view it as the “starting point of any analysis, as the cause in some sense of the surrounding architecture”—even though such a view would clearly be false. Analogously, having observed a particular biological trait (Trait X) subserving some particular function (Function X), we are tempted to infer that: 1) the trait was directly selected for; and 2) the trait was directly selected for that particular function.

But there are a couple problems with this inference.


First, Trait X might have originally evolved to subserve a different function, e.g., Function Y, and it has now been co-opted or exapted for Function X. There are plenty of examples of exaptation: bird feathers, for example, initially helped with temperature regulation, and were later exapted (and then further adapted) for flight. In some of these cases, the interpretive constraint is pretty subtle: it’s not necessarily that Trait X wasn’t adapted at all for Function X—i.e., once bird feathers were exapted for flight, they underwent additional selection pressures for that function—it’s just that the original locus of selection was for Function Y.

Similarly, one might endorse Chomsky’s view that the use of human language for communication represents exaptation, not adaptation (Hauser et al, 2013), but still believe that subsequently, human language has been shaped by communicative pressures.

Hitch-hiking traits

It’s also possible that Trait X emerged because of a selection pressure for Trait Y—i.e., Trait X is the byproduct of selection for Trait Y.

Though similar to exaptation, there’s an important mechanistic difference. With exaptation, a trait that originally served one function is repurposed for another. With “hitch-hiking”, a trait isn’t directly selected at all, but is the natural consequence of selection for a different trait.

There are all sorts of subtleties here, and I want to be careful not to overstate the case for this kind of phenomenon. But broadly, it’s important to remember that neither genes nor traits are atomic, independent entities: changing one gene can have cascading effects elsewhere in the genome; or, as is the case in pleiotropy, can affect multiple traits simultaneously.

Fuzzy boundaries

What exactly is a “trait”? What are its definitional boundaries? How do we distinguish one trait from another?

The problem here is a philosophical (and linguistic) one. As Gould & Lewontin (1979) note, “organisms are integrated entities, not collections of discrete objects…If we regard the chin as a ‘thing’, rather than as a product of interaction between two growth fields (alveolar and mandibular), then we are led to an interpretation of its origin (recapitulatory) exactly opposite to the one now generally favored (neotenic)” (pg. 585).

I’m certainly not an expert in chin morphology. But the more general point here is that what we call “traits” are often human abstractions—things we consistently notice and refer to. But naming something doesn’t mean it’s “real” in the sense that it was directly selected for; we might be noticing the wrong thing.

Again, the argument is not that identifying traits and their genetic underpinnings is a hopeless endeavor. It’s just complicated, and we ought to have epistemic humility about our ability to carve the world—or the chin—at its joints.

Random genetic drift

Finally, the prevalence of a particular trait in a population may increase (or decrease) due to genetic drift. In this case, the trait need not confer a selective advantage. Rather, a particular allele may become fixed simply as a byproduct of random sampling.

The Wikipedia page on genetic drift gives a great explanation using the analogy of marbles in a jar. The basic setup is: imagine you have a jar with 10 red marbles and 10 blue marbles—these represent the 20 “organisms” in a population. Now, to create the next generation, randomly sample a marble from the original jar (let’s say it’s blue), and deposit a marble of that same color in a new jar (representing the newest generation). This new jar will have a mixture of red and blue marbles—but the precise breakdown will depend on the outcome of random sampling from that initial jar. For example, maybe we sampled 12 blue marbles and only 8 red marbles. Crucially, we then apply the same process to the second jar to create the third generation. Now, blue marbles are more prevalent than red ones (a 12:8 ratio), which means that they’re more likely to be sampled (60% vs. 40%), which in turn means that the third generation might have an even higher proportion of blue marbles! With each successive generation, blue marbles become steadily more frequent, until they reach fixation: i.e., no more red marbles. This isn’t because the blue marbles are more fit than the red marbles. If we’d instead sampled 12 red marbles and only 8 blue ones in the second generation, then we would’ve seen fixation of red marbles instead.

A crucial component of this mechanism is that each “sample” is dependent on the current population structure—and in turn, each successive generation can change that population structure. So originally, the two alleles were evenly distributed across the population; any variance in which allele we sampled more often was purely a product of random chance. But that variance results in an uneven distribution of those alleles among successive generations, which has a compounding effect.

In contrast, if every generation was the product of sampling from that original generation, the Central Limit Theorem states that we’d expect to see something like a normal distribution of blue:red ratios, centered around 10:10–with some number having more red marbles, and some having more blue marbles. But in the case of evolution, each generation is dependent on the generation that came before.

The key takeaway here is that we can’t simply take the prevalence of a particular trait as evidence that the trait was selected for. Even in the absence of the “hitch-hiking” or exaptation mechanisms I described above, alleles can reach fixation purely as a byproduct of random sampling.

This is where neutral theory comes in.

The Rise of Neutral Theory

The core idea of “neutral theory”, as originally formulated (Kimura, 1979), is as follows:

at the molecular level most evolutionary changes are caused by the “random drift” of selectively equivalent mutant genes. (pg. 98)

Note the phrase at the beginning: “at the molecular level”. Kimura (1979) distinguishes between the mechanisms underlying phenotypic evolution, on the one hand, and molecular evolution on the other:

Even if Darwin’s principle of natural selection prevails in determining evolution at the phenotypic level, down at the level of the internal structure of the genetic material a great deal of evolutionary change is propelled by random drift. (pg. 126)

The argument, then, is not that observable phenotypic variation is primarily due to random drift. Rather, Kimura is arguing that a huge number of genetic mutations are fundamentally “neutral”, in that they are not directly advantageous or disadvantageous for an organism’s survival and reproduction. Many (or perhaps most) or these mutations will simply disappear from the population over time—but a handful will reach fixation eventually:

Neutralists, on the other hand, contend that some mutants can spread through a population on their own without having any selective advantage. In the course of this random drift the overwhelming majority of mutant alleles are lost by chance, but a remaining minority of them eventually become fixed in the population. (pg. 100)

There are some important consequences to accepting this premise. Notably, if alleles can reach fixation not through selective advantage but through random drift––and if, as Kimura suggests, this is actually quite common––then simply observing that an allele has reached fixation does not entail adaptation per se.

Moreover, Kimura points out that simple, quantitative models rooted in this premise of neutrality can do a good job of explaining evolutionary changes within and across populations. If parsimony is a goal of our scientific theories, then it would seem we ought to prefer these simple, “neutral” models over others that invoke more complicated, sometimes circuitous selection pressures. But more on that in a bit.

What’s the alternative?

Here, I think it’s important to emphasize what Kimura (1979) is not saying, and also reiterate the position he’s arguing against.

First, Kimura (1979) points out that these neutral alleles are not necessarily functionless:

The neutral theory, I should make clear, does not assume that neutral genes are functionless but only that various alleles may be equally effective in promoting the survival and reproduction of the individual. (pg. 100)

For example, there’s considerable redundancy in the neural code––different genes, or different alleles of the same gene, can code for the same protein. As long as these variants don’t produce downstream consequences that differentially affect an organisms’s fitness, they are equivalent, i.e., neutral.

Second, according to Kimura, the dominant view at the time was one in which virtually all biological traits could be interpreted through the light of adaptive evolution. In this adaptationist view, few (if any) genes were truly “neutral”; even if we didn’t understand the function of a particular mutation, the assumption was that if one allele “won out” over another, it was more adaptive in some way.

Competing theories

I’m obviously an outsider to evolutionary biology. But looking in, it seems like there are at least two significant differences between the neutralist and adaptationist views. (I should emphasize that this is purely my own interpretation of the differences, having read some of the relevant literature. If anyone reading this knows otherwise, don’t hesitate to let me know.)

The first difference regards the presumed prevalence of neutral genes. Neutralists like Kimura (1979) argue that a large number of genetic mutations are neutral, while adaptationists think that very few are truly neutral (1).

This leads us to the second difference: the two theories have different views of what the appropriate null hypothesis ought to be. That is, given some particular mutant allele with an unknown function, but which appears to be reaching fixation in a population, what should our default assumption be about whether that allele confers a selective advantage? Because adaptationists believe there are few truly neutral mutations, they might argue that we should assume there is some advantage to that mutation, even if we don’t yet know what it is. Then, the scientific work would lie in formulating various theories of how exactly that mutation is more beneficial than competing variants. In contrast, neutralists think that a large number of mutations are selectively neutral; thus, they might argue that we should assume this newly observed mutation is also neutral, unless presented with evidence that it leads to phenotypic variation that would plausibly confer a selective advantage.

So should we assume selection or neutrality, given some empirical observation? It’s hard to say. But importantly, neutral models have helped serve as baselines––i.e., a null hypothesis––in studies of molecular evolution. For example, Hey (1999) writes that even though certain assumptions of neutral theory were disproven in the fruit fly, neutral theory played (and continues to play) an essential role as a null model:

Indeed, neutrality continues to be the baseline limiting case for virtually all evolutionary genetic theory, and Kimura’s theoretical discoveries are continuously drawn upon by evolutionary geneticists for all manner of applied and basic research questions. (pg. 37)

I think this question of baselines is a fascinating one, and I’ll return to it later on in the essay. But first, I want to talk about the topic that inspired this post in the first place: models of cultural evolution.

Neutral models in cultural evolution

So far, this post has focused primarily on biological evolution. But in Cognitive Science and Linguistics, many researchers are also interested in so-called cultural evolution: the mechanisms by which cultural structures or products (like language) change across generations.

Primer on cultural evolution

The study of cultural evolution imports a number of concepts from evolutionary biology.

Let’s focus on the example of a language, or particular features of a language. We might consider these features to be tantamount to “organisms” or “genes”, embedded in an “ecosystem” of human speakers (and of course the actual environment in which those human speakers live and interact).

Additionally, we have some set of mechanisms for introducing variation into that system––i.e., for producing mutations. For example, perhaps children make errors while learning a language. Over time, some of those errors make it into the language itself, changing the pool of features that future children must learn (Niyogi & Berwick, 1997; though see Blythe & Croft, 2021 for an argument that the role of errors in development is overstated).

And most crucially for our purposes, some of those features are more advantageous than others. Fitness is likely the product of a constellation of selective pressures, but many theorists focus on learnability and communicative utility (Dingemanse et al, 2015). That is, features of a language need to be learnable––especially by children––and also serve the communicative needs of speakers of that language.

Of course, a paradigm that emphasizes the role of adaptation might lead one to conclude that all regularity in language structure is itself the product of adaptation to some particular selective pressure. But that’s not necessarily the case.

Neutral baselines

As in evolutionary biology (Kimura, 1979; Hey, 1999), the use of neutral models has become more popular in studies of cultural evolution as well.

These neutral models have been put to several different purposes.

Sometimes they serve as a reflection of a particular baseline or null hypothesis (I’ll call this the Baseline approach). These may not really correspond to a strict definition of “neutral theory” per se, they’re just models that allow us to get a sense of what some system would look like in the absence of a given evolutionary pressure. (One might argue that these shouldn’t be called “neutral models” at all, and that the term should be reserved for models that do fit a stricter criterion, which is described below; I don’t have an opinion on this.)

And some models do fit this stricter criterion––i.e., more than just serving as a baseline of some process, they’re directly inspired by Kimura’s model of evolutionary change through genetic drift (I’ll call these the Drift Approach).

I’ll summarize some examples of both approaches below.

The Baseline Approach: Simulated Lexica

Why do we have the words that we do? And why do some wordforms have more meanings than other wordforms?

These questions are about the lexica of human languages: what design principles determine the wordforms that “survive” and the ones that don’t––and what principles govern the distribution of meanings across those wordforms? The intuition behind functionalist theories of the lexicon is that these design principles can be located in the underlying constraints (and affordances) of human cognition and communication. That is, we have the words we have because they’re efficient solutions to the problems of learning, memory, and communication (Richie, 2016).

This is a broad answer to the question, of course––and there are many more specific questions one could ask. You could take practically any fact about human lexica and ask: how does this serve a communicative or cognitive function?

Is the lexicon clumpier than expected?

Consider the phenomenon of phonological neighbors (sometimes called minimal pairs): wordforms differing only in one phoneme (e.g., “cat” vs. “mat” vs. “bat”). Human lexica are clumpy: there are plenty of phonotactically possible wordforms we don’t have, and other areas of the possible phonotactic space are very densely populated.

From the perspective of clarity, this is surprising: perceptually similar wordforms are more confusable, which should impair communication. On the other hand, there’s some evidence that clumpiness actually facilitates learning and speech production (see Dautriche et al, 2017 for a review).

This raises the question: is the lexicon a product of the first pressure (for less clumpiness) or the second (for more clumpiness)? And most importantly: how would we know? After all, the lexica of human languages are exactly as clumpy as they are––how much clumpiness should we expect in the absence of either such pressure?

Dautriche et al (2017) address this question by simulating lexica that obey the phonotactics of a given language. So for four languages (English, Dutch, German, and French), they train up a Markov Model that learns which sounds are most likely to start and end a word, which sounds occur in which order, and so on––and then they generate a bunch of words that sound like words from the target language. That is, they generate a number of artificial lexica, which can serve as baselines for how much clumpiness one expects from the phonotactics of a language alone.

This allows them to ask: are real lexica more or less clumpy than these baselines?

As it turns out, real lexica are clumpier. Phonological neighborhoods are larger on average (and more densely connected) in real human languages than they are in the baselines (Dautriche et al, 2017). This suggests that a pressure for clumpiness outweights the potential costs, and is thus larger than a pressure for dispersion––but we can only know that by consulting some neutral baseline.

Is homophony efficiently distributed?

Another, related question is why lexica have so much ambiguity.

One way of answering this is to identify which wordforms have the most meanings––i.e., which are the most homophonous. Famously, meanings are disproportionately concentrated among short, frequent, and phonotactically probable wordforms (Zipf, 1949; Piantadosi et al, 2012). Under one account, this reflects a pressure for efficiency. The logic is as follows: if homophones can be disambiguated by context, then some degree of ambiguity is tolerated and even encouraged to avoid redundancy; and further, language users ought to prefer a lexicon in which the more “optimal” wordforms are recycled for more meanings––given a choice between reusing a short word and long word, you should go with the short word. And this is consistent with the empirical evidence: after all, short words are more ambiguous.

But this interpretation doesn’t account for a baseline. One might expect short words to be more ambiguous simply because they’re shorter, even in the absence of a pressure to reuse short words––it’s more likely that a random word generate would produce the same short word twice than the same long word. Similarly, words that are more phonotactically probable should––by definition––be more likely to be generated multiple times, simply by random chance.

Thus, we need to ask: how many meanings would you expect short, phonotactically probable wordforms to have in the absence of a pressure for or against homophones? This question was taken up in a pair of recent papers, including one by our lab (Trott & Bergen, 2020; Caplan et al, 2020); I discussed them both in more detail in a recent post, but I’ll summarize them briefly here.

Basically, we followed the same procedure that Dautriche et al (2017) used for evaluating clumpiness. But we found that real lexica have fewer homophones than their phonotactics predict––and crucially, real lexica also distribute their homophones less distributely. That is, short wordforms in real human lexica have fewer homophones than you’d expect on account of their phonotactics.

Once again, the use of a baseline helped resolve a question about whether some empirical observation could be attributed to a selection pressure. But in this case, the answer was no: an explanation of how homophones distribute in human lexica does not need to posit a positive selection pressure for reusing short, phonotactically probable homophones. In fact, given that the baselines had more homophones than real lexica, an explanation might need to posit a pressure against homophony. In this way, the conclusion mirrors one of the tenets (if not the methodology) of neutral theory: Kimura (1979) acknowledges that many mutations are deleterious and are thus selected against.

The Drift Approach

There’s also a growing number of papers that are more directly inspired by Kimura’s (1979) notion of a neutral model. That is, beyond simply using any baseline, they try to explain a given empirical observation by using an evolutionary neutral model in which each “generation” inherits certain properties from a previous generation, and in which the ultimate outcome can be explained via genetic drift.

This is summarized by Bentley et al (2004, pg. 1443):

In the neutral model, there are N individuals, each characterized by a behavioural or stylistic variant, such as a first name…At each time step, each of N new individuals copies its variant from an individual randomly selected from the previous time step. To this very simple process we add innovation (analogous to genetic mutation) as the continuous introduction of new unique variants over time….Having defined this simple model, we can predict the effect of random drift on the statistics of the variants, simply by knowing the size of a population, N, and the innovation (mutation) rate, mu, or even just their product…

Comparing the approaches

There’s an important distinction between this approach and the Baseline Approach described before. Here, there’s special attention paid to the process underlying the model, not just the outcome. The model itself is evolutionary in the sense that there are generations with particular properties, a rate at which new properties are introduced into the population, and a mechanism for new generations for inheriting those properties; the papers I described earlier (e.g., Trott & Bergen, 2020) don’t model anything like generational change.

I’m not arguing that one approach is better than the other––simply trying to point out that there is a distinction. I’m just (perhaps unfairly) lumping both under the broader category of “using simulations as a baseline to test an evolutionary claim”.

Using the Drift Approach

Bentley et al (2004) describe how this approach has been used to explain several different cultural properties: 1) the frequency distribution of baby names; 2) the citations of various patents; and 3) pottery motifs in the sixth millenium BC. Importantly, all three of these properties follow a power-law distribution:

A power-law distribution of variant frequencies in a population means that there are many uncommon variants and a very few popular variants that are thousands of times more popular than the majority. (Bentley et al, 2004, pg. 1443)

Often power-laws are interpreted as obeying a “rich get richer” phenomenon (sometimes called preferential attachment). For example, if highly cited papers are more likely to be cited again, this creates a scenario where citation rates will follow something like a power law: the most cited papers continue to be cited even more, and then a large number of other papers are cited much less frequently.

Let’s consider the case of baby names specifically. It turns out that baby names follow a power-law distribution (Hanh & Bentley, 2003):

This distribution shows that there are a very few names that are highly popular (in frequencies approaching 10%), whereas there are many names present at very low frequencies (at or below 0.01%). (Hanh & Bentley, 2003, pg. S120)

As an explanation for this pattern, the authors propose a model that amounts to cultural drift:

To explain the stable distribution of name frequencies despite the change in baby name usage over time, we suggest that baby names are value-neutral cultural traits chosen proportionally from the population of existing names, created by ‘mutation’ and lost through sampling. (Hanh & Bentley, 2003, pg. S121).

And later, in Bentley et al (2004), the authors show that a simulation of a neutral model ends up fitting the baby name data quite well. Specifically, given just two parameters––the number of variants (N) and the mutation rate (mu)––the model is able to produce a distribution that looks a lot like the real distribution of baby name frequencies.

Challenges to Neutral Theory

So far, it seems like neutral models––construed either broadly, in terms of “use a baseline to understand an evolutionary pressure”, or more narrowly, in terms of “model evolutionary change using random genetic drift”––provide a useful and even essential lens to adjudicate between competing theories of cultural change.

But at least in evolutionary biology (and ecology), neutral theory has proven a very contentious topic (Hey, 1999), with some (Kern et al, 2018) arguing that it should be rejected (at least in most cases). And some (Leroi et al, 2020) have made similar critiques of the use of neutral models in cultural evolution as well.

So what are these challenges?

In the sections below, I’m going to focus on critiques made of neutral models used in cultural evolution specifically. As I hope I’ve made clear, I’m not a molecular biologist––so my understanding of the debate about neutral theory in evolutionary biology is pretty limited, especially because it seems to hinge on specific predictions about things like the rate of recombination at particular sites in the genome, and levels of polymorphism at those sites. I might try to summarize it in another post, but that might also be better left to a domain expert. If you’re interested, these two recent papers present contrasting views on the utility of neutral theory in molecular biology:

Kern, A. D., & Hahn, M. W. (2018). The neutral theory in light of natural selection. Molecular Biology and Evolution, 35(6), 1366–1371.

Jensen, J. D., Payseur, B. A., Stephan, W., Aquadro, C. F., Lynch, M., Charlesworth, D., & Charlesworth, B. (2019). The importance of the Neutral Theory in 1968 and 50 years on: A response to Kern and Hahn 2018. Evolution, 73(1), 111–114.

Challenge 1: Neutral models lack plausibility

In an article called “Neutral Syndrome”, Leroi et al (2020) argue that the explanatory power of neutral models is overstated. The core argument is aptly summarized here:

We may agree that a model of neutral evolution generates a particular kind of abundance distribution, but the existence of such distributions in nature does not prove that the world is neutral.

In other words: proponents of neutralist explanations assume that because their neutral model fits the data well, it’s an adequate explanation for the data––that is, the real data were generated by a neutral process. But this isn’t necessarily true. The ability of a model to fit the data does not entail that the model is correct.

The authors write:

It is certainly true that many parents bestow existing names on their offspring, but it cannot be, as the Wright–Fisher neutral model supposes, that American parents literally choose names from those given to the previous cohort in proportion to their relative abundances, for if they did then many Christians would call their sons ‘Muhammed’. Nor can it be, as preferential attachment models suppose, that scientists cite papers purely in proportion to how often they’ve already been cited, for if they did, their papers would be incoherent.

This is perhaps a little tongue-in-cheek, but the argument boils down to the observation that preferential attachment is simply an implausible mechanism for how parents choose names for their babies, or for how scientists choose which papers to cite. Frequency likely plays some role––e.g., a scientist might cite a paper not because it’s entirely relevant, but because it’s frequently cited for this subject––but clearly it’s not the only determining factor.

Piantadosi et al (2013) make a similar argument:

It is not informative to show that other assumptions could also lead to the observed behavior, if those other assumptions are demonstrably not at play (pg. 6)

In their case, they’re arguing that just because a random word generator is able to approximate a well-known empirical observation (Zipf’s frequency law), does not mean that it ought to have a privileged status as a baseline––or certainly as an explanation of the phenomenon. After all, speakers don’t produce words by just randomly sampling sounds from their phonemic inventory.

In both cases, then, the argument is that we shouldn’t evaluate the quality of an explanatory theory purely on the basis of how well it fits the data. We should also consider its plausibility and how well it fits with other things we know about the phenomenon. Put another way, we should consider the prior probability of our theories, not just the likelihood of the data under each theory. Note, however, that Leroi et al (2020) are discussing “truly” neutral models (in the sense of models of drift), while Piantadosi et al (2013) are referring to a statistical baseline (which is “neutral” in the sense that it doesn’t include the posited selection pressure).

Leroi et al (2020) do actually acknowledge the utility of neutral models as baselines:

Viewing neutral models as nulls is more defensible…

But they argue that a more fruitful approach would be one of parameter estimation:

…but we believe that the best way to understand the causes of diversity is to view it as a problem in parameter estimation.

In other words, we shouldn’t just assume it’s a question of drift vs. selection. Instead, include each of these factors as parameters in a model, and treat it as an exercise in parameter estimation. What’s the relative magnitude of the drift parameter vs. the selection parameter?

This suggestion seems fair to me, and could maybe be interpreted as a more pluralistic view in which both neutralist and selectionist accounts hold some weight.

Challenge 2: There’s no such thing as neutrality

I’ll add another critique, which isn’t discussed by Leroi et al (2020), but which could be seen as a critique of baselines more generally: none of these models are really “neutral”.

Take the simulated lexica in Trott & Bergen (2020). They were generated using two parameters: the phonotactics of the target language, and the distribution of word lengths in the target length. So in what sense are those lexica “neutral”?

Here’s my counterargument: it’s true that no model is neutral. All models make assumptions in the form of which parameters they include, and which they don’t. But importantly, those simulated lexica are neutral with respect to a particular pressure: i.e., they don’t assume a pressure for or against homophones. So regardless of whether we want to call them “neutral” or not, I think it’s fair to say that they’re agnostic to the selection pressure in question, and thus serve an important role as baselines. But more on that below.

Is neutral theory useful?

So where does all this leave us? Is neutral theory useful, or should we abandon it?


I’ll start with a quick recap:

In the first section, I argued that we should be careful about attributing some trait/function pairing to a positive selection pressure. Traits can be exapted for new purposes, hitch-hike on other genes that were selected for, and undergo genetic drift. All these factors point to the difficulty in determining whether a given trait was positively selected for.

In the second section, I described a particular theoretical framework, called neutral theory, that explicitly holds that most mutations at the molecular level are selectively neutral, many are deleterious (and selected against), and relatively few are positive (and selected for).

In the third section, I described the application of this framework to cultural evolution. I also distinguished between the use of “neutral” baselines more generally––which may or may not be committed to a particular process model of generational change––and neutral models that rely on things like drift and preferential attachment as an explanation for cultural trends.

In the fourth section, I described two challenges to neutral theory. First, many neutral models may not actually be plausible models of the phenomenon at hand (Piantadosi et al, 2013; Leroi et al, 2020); thus, we shouldn’t give them special weight just because they happen to explain the data well. And second, no model is truly “neutral”––all models make assumptions.

Moving Forward

Now, we turn to the question posed by the post’s title: given these challenges, is neutral theory useful?

My position is: yes, it is useful––as a baseline.

First, note that I actually agree with part of the “plausibility critique” (Piantadosi et al, 2013; Leroi et al, 2020): the ability of a model to explain a set of data doesn’t necessarily mean that this model is the true account of how that data were generated.

But importantly, this is alss also true of non-neutral models! Which takes us back to the original point from the first section: lots of empirical observations are consistent with positive selection pressures, but that doesn’t automatically confirm those selection pressures. We have to ask: what other theories would predict those same empirical observations? Can they be explained by a theory that doesn’t posit those selection pressures (2), i.e., a theory that’s “neutral” with respect to those pressures?

Of course, this “neutral” theory needs to be plausible as well. But to the extent that it is plausible, and to the extent that it explains empirical data just as well as a model that does posit selection pressures, I think it makes sense to view these neutral models as baselines.

Under one view, these baselines are akin to null hypotheses. In null hypothesis significance testing (NHST), we never accept the null hypothesis––we either reject it, in favor of the alternative hypothesis, or we fail to reject the null. Similarly, if a neutral baseline explains the data equally well (or even better) as a model that includes the posited selection pressure, we should fail to reject that neutral baseline as a plausible account of the data. In other words, we’re not licensed to posit the selection pressure; the data can be explained equally well without it. On the other hand, not everyone agrees with the logic of NHST. So under another view, these neutral baselines might be seen as actually competing theories of the data––not just a null hypothesis.

Either way, my view is that building out a neutral model––building a baseline––should be an important part of making any evolutionary claim, if only as a kind of disciplining exercise (3). Just-so stories are satisfying to construct (and to read), and that makes them all the more tempting. Building a baseline makes our task a little harder, but I think that’s a good thing.

Related posts:


Alonso, D., Etienne, R. S., & McKane, A. J. (2006). The merits of neutral theory. Trends in ecology & evolution, 21(8), 451-457.

Bentley, R. A., Hahn, M. W., & Shennan, S. J. (2004). Random drift and culture change. Proceedings of the Royal Society B: Biological Sciences, 271(1547), 1443–1450.

Blythe, R. A., & Croft, W. (2021). How individuals change language. Plos one, 16(6), e0252582.

Caplan, S., Kodner, J., & Yang, C. (2019). Miller ’ s Monkey Updated : Communicative Efficiency and the Statistics of Words in Natural Language. Cognition, 1–19.

Dautriche, I., Mahowald, K., Gibson, E., Christophe, A., & Piantadosi, S. T. (2017). Words cluster phonetically beyond phonotactic regularities. Cognition, 163, 128–145.

Dingemanse, M., Blasi, D. E., Lupyan, G., Christiansen, M. H., & Monaghan, P. (2015). Arbitrariness, Iconicity, and Systematicity in Language. Trends in Cognitive Sciences, 19(10), 603–615.

Gould, S. J., & Lewontin, R. C. (1979). The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proceedings of the Royal Society of London - Biological Sciences, 205(1161), 581–598.

Hahn, M. W., & Bentley, R. A. (2003). Drift as a mechanism for cultural change: An example from baby names. Proceedings of the Royal Society B: Biological Sciences, 270(SUPPL. 1), 120–123.

Hauser, Marc; Chomsky, Noam; Fitch, W. T. (2013). The Faculty of Language: What Is It, Who Has It, and How Did It Evolve? Science, 298(02), 124﹣129.

Kern, A. D., & Hahn, M. W. (2018). The neutral theory in light of natural selection. Molecular Biology and Evolution, 35(6), 1366–1371.

Kimura M. (1968.) Evolutionary rate at the molecular level. Nature 217(5129):624–626

Kimura, M. (1979). The neutral theory of molecular evolution. Scientific American, 241(5), 98-129.

Kreitman, M. (1996). The neutral theory is dead. Long live the neutral theory. BioEssays, 18(8), 678–683.

Leroi, A. M., Lambert, B., Rosindell, J., Zhang, X., & Kokkoris, G. D. (2020). Neutral syndrome. Nature Human Behaviour, 4(8), 780–790.

Niyogi, P., & Berwick, R. C. (1997). A dynamical systems model for language change. Complex Systems, 11(3), 161-204.

Ohta, T., & Gillespie, J. H. (1996). Development of neutral nearly neutral theories. Theoretical Population Biology, 49(2), 128–142.

Piantadosi, S. T., Tily, H., & Gibson, E. (2012). The communicative function of ambiguity in language. Cognition, 122(3), 280-291.

Piantadosi, S. T., Tily, H., & Gibson, E. (2013). Information content versus word length in natural language: A reply to ferrer-i-cancho and moscoso del prado martin [arXiv: 1209.1751]. arXiv preprint arXiv:1307.6726.

Richie, R. (2016). Functionalism in the lexicon. The Mental Lexicon, 11(3), 429–466.

Trott, S., & Bergen, B. (2020). Why do human languages have homophones? Cognition, 205(August), 104449.

Zipf, G. K. (1949). Human behavior and the principle of least effort. Addison-Wesley Press.


  1. As a side note, I think a number of “big debates” boil down to a question of magnitude. In my corner of Cognitive Science, people debate about questions like “are cognitive processes modular” or “does language influence thought”. But as far as I can tell, most people are willing to cede a little ground (e.g., even the strongest modularist might concede that the language one speaks can influence aspects of cognition, just in small ways around the margins). This means the value of any particular scientific finding kind of depends on one’s null hypothesis––which takes us back to the idea of neutral models. 

  2. And of course, this isn’t just true of questions about evolution and adaptation: it’s always critical to consider alternative accounts of how a given set of data were generated. 

  3. I think this view is actually consistent with the critique presented by Leroi et al (2020). They argue that a model of preferential attachment is a poor explanation of how baby names are chosen. And that might be true: but critically, in the baby name case, no alternative was presented––the neutral model was not being used as a baseline, but rather as a stand-alone explanation. Moving forward, perhaps that question might be fruitfully addressed by developing such alternative theoretical explanations––ideally linking individual decision-making by parents with large-scale trends. And importantly, the insights from Bentley et al (2004) might actually inform these theories. For instance, I agree with Leroi et al (2020) that it’s unlikely that parents consciously sample from the frequency distribution of baby names; on the other hand, the frequency of a given name––in a particular social group––likely influences the salience of the name, and perhaps even the extent to which parents favor it. Alternatively, maybe it’s not just about raw frequency, but whether parents have positive or negative valence towards other individuals with that name. Researchers could formalize these accounts––the Raw Frequency Account, as well as the Frequency + Valence Account––and ask whether one is a better explanation of the resulting distribution of baby names. 

Written on June 25, 2021