Thursday, April 30, 2015

The tail that doesn't wag: Why?

We know of countless traits that are due to mutations in single genes.  The same allele may not always confer exactly the same disease or other trait, because other genes or environmental factors may contribute, but for most purposes this additional variation is unimportant or at least tractable.  These traits appear in families in roughly Mendelian proportions, as has long been known.

However, most traits including most of the common diseases, do not segregate in families. Instead, it is clear that many different genetic factors are contributing, and most of these effects are individually small.  Further, lifestyle factors are usually even more important, in aggregate.

GWAS and other gene-searching methods have shown that this is likely to be the case for many important diseases, as we’ve written about here many times (and many others have written about elsewhere).  The state of such traits can be depicted schematically as in this figure, where just 3 different genes, each with two states, are shown.  The different combinations of variants are shown with their average associated trait value.  Here, the capital letter allele at each gene confers greater trait value.  This is highly simplified but represents our usual reconciliation of Mendelian genetic effects and the complex traits we observe.  The simplification, as far as it goes, doesn’t affect the main point.  In essence the causal model is that a huge (in many applications, essentially infinite) number of contributing genes is involved, each making a minor contribution.

Schematic distribution of stature and contributing genotypes

The causal complexity so often observed is a problem for the understandable yearning for simple causation—for the promise of ‘precision’ based or highly predictive medicine based on genotype.  We won’t here yet again belabor the reasons this is a culpable fantasy being perpetrated on the paying public, because we’re after a different point. 

If most individuals are in the middle of the population’s trait distribution (e.g., are of roughly average height in the figure), you can see that there are many different genotypes that confer the same trait value.  The sample size is large for the middle group, the bulk of the population, but no single genotype stands out as ‘causing’ average height.  But this is a general model of equal effects, and perhaps there are ways to go beyond such population averages and mine the data for those individual variants that do have identifiably strong effects.

In this figure, as in most textbook illustrations of the point, only a few genes (and those with only two variants each) are shown. But really we know there are tens, hundreds, or even thousands of genome regions that may contribute. This might suggest some strategies. Perhaps the extremes—the tails—can be used to inform us about what is going on in the whole distribution. We can let the tail wag the genetic causal dog in a few ways, perhaps.

The idea of tractability
The search for causal relationships is necessarily reductionist and naturally leads to the search for study design or analysis 'tricks' to turn complexity into simplicity, or at least tractability by some meaningful standard.  Can things be found by some approaches not to be so complex, with at least for some segments of the population, simpler genomic causation?  The following are instances in which this is thought possibly to be so.

1.  Rare variants in families.  Rare variants with strong effect can sometimes be found in close relatives with the same trait.  This may mean a clear-cut and hence typically also rare trait--something far from the norm.  Buried in a population sample, they could simply not be frequent enough to generate a statistical signal.  But, close relatives share big chunks of their genome as well as environments, so one must have some criterion for assigning causation to a genetic variant. A variant that creates a stop codon in a physiological relevant ('candidate'?) gene would be one such.  Even in the huge general population samples that are being collected, families will be identifiable, so a once-old-now-new family-based approach may be able to find some important variants.

2.  Multiple rare variants in the same gene, but different variants in different affected persons, especially when found in a gene known by more common variants or for some other physiological or functional reason to be a plausible causal factor.  The figure only shows two variants per each (A,B,C) gene. But different people may have different variants in the same gene.  They aren't likely to be found in simple whole-population studies, at least not initially.   But if in some way a strong variant identifies the gene as a possible candidate, and then an examination of population data shows other people with similar traits having other variants in the same gene, the gene gains causal plausibility, even if these variants don't seem to be sufficient on their own to be detected in association studies..

3.  Tail-wagging.  If concentrations of individually rare variants are found in individuals with phenotypes at the extremes of a trait distribution, they may be suspected as being causal.  People in the tail might share similar multi-site genotypes, even if the individual variants are generally rare. If the variants consistently contribute in non-trivial ways to the trait, then maybe if we look at those individuals who clearly would have collections of such variants, we might find them.  The tail of the distribution will show us the genes and then we can search for their effects in individuals with less extreme trait values.

In the figure this is clear.  All those individuals with very low or with very high trait values have the same or nearly the same genotype (e.g., AaBBCC, AABbCC, AABBCc, and AABBCC for the larger trait values).  The same thing goes for the lower tail (arrows in the figure).  The role of the 'capital letter' variants in these genes would be clear.  Of course, this classroom figure only shows 3 different genes but one can easily imagine the same sort of thing if there were 10 or hundreds of such contributors.  Environmental effects will of course make this less clear, but the tail might still wag the causal dog for us.  So what has been found?

Checking for the wag
Unfortunately, studies of extreme phenotypes have not yielded much, except for those already long-known because they are basically single-gene traits (CF, PKU, Tay-Sachs, MS,....), which mainly didn't require GWAS etc. to find.  Once they have been found, other variants of lesser effects have indeed been found, and though the story is more complex than that, for our purposes the single-gene effects did their job.

More importantly, for common traits one might hope to find clearer causation in very-high phenotype individuals.  However, where this has been looked for, such as in studies to map the genetic effects on traits like intelligence and stature, investigators have not found much tail-vs-middle difference, as far as we are aware.  A new study of supercentenarians, people in the extreme of the longevity distribution, did not find anything that explained their long survivorship.  The tail is not wagging the dog!

How can this be?  Is environment obscuring things even in the extremes?  Is it the reason those few people are in the extremes?  Or are we making some other mistaken assumption?  How can the extremes not be causally simpler than in the bulk of the population?  This seems a conundrum.

If we believe the evidence, there seem to be as many ways to be in the tails as to be in the middle of the trait distribution, with each person being genotypically unique.  The tails are not wagging the causal dog.  But why not?

What might this mean?  
This is curious because if there are finite numbers of contributing genes, with a distribution of allelic effects, the normal (unimodal, or central tendency) trait distribution, with most people in the middle, would suggest that there are more ways to get there than there are to be in the extremes.  This should also be true of mixtures of rare variants, shouldn't it?  Maybe not!

The figure is a simplified representation of the classical model for polygenic traits due to RA Fisher that basically was essentially of an infinite number of sites each contributing infinitesimal amounts.  In the limit, there are infinitely many ways to be in any part of the distribution. The model is powerful in its applications to various areas of genetics and seems to be basically right, but perhaps there are some key problems with the infinities and infinitesimals underlying the model.

Historically these 'infinities' play a major role in reconciling discrete Mendelian inheritance with quantitative traits and their inheritance, which was an important factor in the 'modern evolutionary synthesis' in the 1930s.  The general theory has served evolutionary and experimental breeders very well for nearly a century.  But is it correct or have we now found a problem area?

I raise this question because in the limit we must be reaching different levels of 'infinity' if our notion is that the reason for the central tendency is that there are more ways to get there than there are to be in the tails--just the assumption we are testing.  But in the limit, to get a smooth distribution and its properties, we essentially assume a greater infinity of ways to be modal than the infinity of ways to be in the tails, and this may be an approximation that makes little sense in the genomic enumeration era--or else that tells us something we need to know.  Infinities are approximations, but maybe the idea of very many contributors runs into practical issues in the kinds of data that GWAS and other studies are using, even their enormous samples.

Could the lag of 'wag' be that we are not dealing with what mathematicians or physicists would call 'well-posed' questions? Maybe stature, obesity, diabetes, or heart disease are not biologically unitary traits.  Then, if they are instead complexes of multiple partly independent (and separately evolved) traits, maybe being simple in the tail is not what we should expect.  Maybe what we are calling a trait is not what evolution 'called' any such thing.

Or it could be that 'infinity' here just means a great many, so that the gist of things is that even in the tails, there really is an essentially unlimited number of ways to inherit few or many small (left tail) or large-effect (right tail) alleles.  And since in any case the number of different combinations is large, and the presence of specific variants small, statistical methods can't enumerate them very well.  One may get into the tails because his/her huge collection of rare or even unique variants have in aggregate more 'large' effect than the collections of those in the middle of the distribution, but each person is unique and the extra individual effects are trivially small.  Our methods are not suited to detect this.  Or maybe the same variants are found across the distribution, but they are slightly more individually common in those with traits in the tail.

Maybe these differences, and/or even the specific variants involved, are so small or the variants uniquely rare, that aggregate, statistical, probabilistic or distribution-based kinds of approaches just won't find them, or just can't find them.  If that is the case, then neither our questions nor our methods are well-posed.

It is perhaps relevant that for many traits we really do have a simplified tail in the population: the very rare, very pathological, usually early onset and severe instances of traits that do turn out to be largely single-gene in their causation.  But they are not a sufficient part of the samples being studied by most mapping efforts.  At the same time, it is all too easy to forget or conveniently ignore, the massive effects lifestyle factors can make in achieved traits be they physical or behavioral.  No wonder, even in the tails of the distribution, we don't get a clear genetic wag!

Here, at least, there are things to think about.  These are real questions.  Technical statisticians (which we're not!) may have explanations--but if so, they haven't led to much in the way of clear causal tractability or these general issues about mapping would not be of such widespread concern. How can the tail not be notably simpler in its causation?  Has our explanation here missed something important, or are geneticists as a community missing something?

The questions have both empirical and theoretical meaning.  Whatever one's view about GWAS and massive whole genome sequencing with the goal of predictive medicine, at least the issues raised are perhaps things we all could agree about.

2 comments:

Peter said...

Epistasis and dominance effects will prove to be the key in my opinion.

The vast majority of studies can only detect the additive effects of a given allele on a trait. If an allele promotes phenotype P on one genetic background, but not on another, then the study will be blind to it.

We could have predicted this right back at the start for any phenotype that has significant effects on reproductive success. Natural selection will eliminate any alleles that have significant additive detrimental effects on fitness. It will by definition leave only those alleles where the causal chain is too obscure/indirect to allow selection to operate. These will thus also be too obscure/indirect to be detected in our studies, and too contingent on other factors to be useful for prediction.

Short version: If you can use a given gene variant to reliably diagnose or predict a disease, then evolution can, will, and already has gotten rid of it.

Ken Weiss said...

Peter,
As far as you go, one would agree that this is the general gist of population genetics views of things. But the jury is still out as to the degree to which missing heritability is due to interactions, and the traits being studied are largely post-reproductive and not affecting fitness. Even just additive contributions should be less diverse in the tails of the distribution by the usual, if informal, model assumptions. If very large, individual-specific collections of very rare, mainly very weak, factors are responsible our typical models will not be finding them (and they aren't). Also, things are similar for non-disease traits. I do think that selection by nature and prediction by Francis Collins are similar in terms of allelic or genotypic effects.

Interactions need not be simple pairwise ones, however. An allele can interact in a distribution of ways depending on context. Some may put the result in the tail of the trait distribution. Our models just aren't looking for things like that, in my view, and I don't think anything I've said here disagrees with your comments.